---
name: dataviz
description: >
  Expert skill for creating data visualizations — charts, graphs, dashboards, infographics,
  interactive plots, and inline SVG/HTML widgets. Use this skill whenever the user asks to
  visualize data, create a chart or graph, plot a dataset, build a dashboard, explore data
  visually, or asks "how should I display this data?". Also trigger for tool/library
  selection questions (ggplot2, seaborn, altair, matplotlib, Vega-Lite, D3.js, Plotly,
  Bokeh, deck.gl), questions about visual encoding, chart type choice, color palettes,
  Gestalt principles, perception/cognition, chart critique, or fixing a misleading
  chart. Even if the user just pastes a CSV, a table, or a messy dataset and asks
  "what can I do with this?" — use this skill. Covers code-based output (Python, R, JS)
  and inline SVG/HTML artifacts, plus evaluation and storytelling for the final product.
---

# Data Visualization Skill

A comprehensive skill for transforming data into effective, accurate, and beautiful
visualizations. Covers task framing, chart selection, visual encoding, tool
implementation, design principles, storytelling, and evaluation — grounded in the
research tradition of Tufte, Bertin, Mackinlay, Cleveland & McGill, Shneiderman,
Franconeri, Wilke, Knaflic, Schwabish, and Murray.

---

## 1. Start with the Task, Not the Chart

Before writing a single line of code or picking a chart type, work through the
"Why / What / How" triad (the framing used across modern viz pedagogy):

1. **Why** — What is the intent? Exploration, explanation, decision support, or art/communication?
2. **What** — What are the data types? Categorical, ordinal, quantitative, sequential, diverging, temporal, geospatial?
3. **How** — What design choices encode, manipulate, facet, or reduce the data? How will the viz be built and read?

Then narrow down:

- **Audience** — experts/analysts or general public? Adjust complexity and labeling accordingly.
- **Question** — comparison, distribution, correlation, trend, part-to-whole, flow, hierarchy, geography?
- **Output format** — inline SVG/HTML artifact, Python/R code, static image, or interactive dashboard?
- **Tool** — if not specified, infer from context (see §3).

If the user says "visualize this" and pastes data, make a principled choice, show it,
and briefly state the reasoning — don't over-ask.

> **Data transformations shape the message.** Dropping vs. imputing missing values,
> aggregating vs. showing raw points, and absolute counts vs. rates per capita are all
> *design decisions* that change what the chart says. Flag these choices; they are not
> neutral.

---

## 2. Chart Selection Guide

Match the **data relationship** to the right chart family. This table reflects the
consensus across Schwabish, Wilke, and data-to-viz.com.

| Relationship | Best Charts | Avoid |
|---|---|---|
| Comparison (categories) | Bar, Dot plot, Lollipop | Pie (>4 slices), 3D, radar |
| Comparison (over time) | Line, Area, Slope (2 points) | Bar for long time series |
| Distribution (1 group) | Histogram, Density, Strip plot | Pie |
| Distribution (many groups) | Violin, Box plot, Ridgeline, Beeswarm | Overlapping histograms |
| Correlation | Scatter, Bubble, Hex-bin, Heatmap | Line chart |
| Part-to-whole | Stacked bar, Treemap, Waffle, Donut (≤5) | 3D pie, exploded pie |
| Ranking | Sorted bar, Slope, Dumbbell | Unsorted bar |
| Trend | Line, Smoothed line, Sparkline | Bar for continuous x |
| Geospatial | Choropleth (rates!), Dot map, Proportional symbol, Cartogram | 3D maps, raw counts on choropleth |
| Flow / Process | Sankey, Alluvial, Chord | Overloaded network diagrams |
| Hierarchy | Treemap, Sunburst, Icicle | Deep nested pies |
| Multivariate | Small multiples, Parallel coordinates, Scatter-matrix | Single chart with 6+ encodings |
| Uncertainty | Error bars, Confidence bands, Gradient plots, HOPs | Hiding the uncertainty |

**Quick decision heuristic:**
- 1 variable → histogram / density / bar
- 2 vars (cat + num) → bar, box, or strip plot
- 2 vars (num + num) → scatter
- 3+ vars → color/size/shape encoding **or** small multiples (prefer the latter once categories > 3)
- Time + numeric → line
- Geography + numeric → choropleth of *rates*, not counts

For deeper guidance including less common charts (ridgeline, waffle, Sankey, cartogram),
load `references/chart_types.md`.

> **Schwabish's rule of thumb:** *"It's not about showing the least amount of data,
> it's about showing the data that matter most."*

---

## 3. Tool & Library Selection

| Context | Recommended | When |
|---|---|---|
| Python, fast exploratory | **seaborn** | pandas-native, colorblind default, minimal code |
| Python, Grammar of Graphics | **altair** / Vega-Lite | Declarative, interactive, JSON spec |
| Python, full control | **matplotlib** | Publication figures, custom layouts |
| Python, interactive web | **plotly**, **bokeh** | Hover, zoom, dashboards |
| R, grammar of graphics | **ggplot2** (+ **ggdist**, **ggalluvial**) | Layered, elegant, publication-ready |
| Inline SVG/HTML in chat | **plain SVG** or **Vega-Lite JSON** | Quick visual for the conversation |
| Web, fully custom | **D3.js** (Murray 2017 reference) | Unique interactive work; steep curve |
| Web, geo | **deck.gl**, **Leaflet** + D3 | Large or 3D geographic data |
| Dashboards | **Plotly Dash** (Py), **Shiny** (R), **Streamlit** | Multi-chart, interactive apps |
| Static/no-code | SVG artifact via `show_widget` | Quick visual for chat |

**Default behavior:** if the user hasn't specified a language and just wants to see
something → produce an **inline SVG or Vega-Lite artifact** using the `show_widget`
tool. If the user is coding in Python → start with seaborn unless they asked for
interactivity.

---

## 4. Visual Encoding: Grammar of Graphics

Every visualization maps **data attributes** to **visual channels**. This is the
Grammar of Graphics (Wilkinson; popularized by Wickham via ggplot2, also underlying
Altair and Vega-Lite).

### The seven components

| Component | Role | Example |
|---|---|---|
| **Data** | The raw table to plot | `df` |
| **Aesthetics (aes)** | Mappings: which column → which channel | `x = date, y = temp, color = season` |
| **Scales** | How data values translate to visual values | `date → pixel x; temp → pixel y` |
| **Geometries (geom)** | The mark type | points, lines, bars, areas |
| **Statistics (stat)** | Transformations before plotting | binning, smoothing, quantiles |
| **Coordinates** | Coordinate system | Cartesian, polar, geographic |
| **Facets** | Splitting into small multiples | `facet_wrap(~ region)` |

This is why ggplot2 chains with `+` and Altair chains with `.encode(...).properties(...)`:
each step adds one layer of the grammar. Build visualizations *incrementally*, one
component at a time, rather than reaching for a fixed chart template.

### Channel effectiveness ranking

Based on Cleveland & McGill (1984), refined by Mackinlay (1986) and Franconeri et al.
(2021). For **quantitative** data, from highest precision to lowest:

**Position → Length → Angle → Area → Intensity (color saturation / value) → Color hue → Volume**

For **categorical** data: Position → Color hue → Shape → Texture.

**Consequences:**
- Encode the **most important variable** with position (x or y).
- Bars beat pies because length beats angle.
- Bubble size should scale by **area**, not radius: `r = √(value / π)`. Our eyes
  underestimate large circles regardless, so add a size legend.
- Use color only when position is already taken, or to highlight a subset.
- Don't stack multiple encodings on the same variable unless you have an accessibility reason.

### Bertin's visual variables (1967)

Jacques Bertin's foundational list — still useful as a mental checklist of what you
*could* encode with: **position, size, shape, value (luminosity), color (hue and
saturation), orientation, texture**. Match the variable to the *structure* of your data
(nominal, ordinal, quantitative).

### Schemas & expectations (Franconeri et al., 2021)

Viewers rely on learned metaphors. Honor them:
- **Up = more**, down = less
- **Left-to-right = time** (in left-reading cultures)
- **Dark = important / more** (for sequential scales)
- **Red = bad / hot / negative**, blue = good / cool / positive
- **Green = good / natural**, yellow = caution

Violating these schemas is occasionally justified (e.g., keeping a brand palette), but
it *costs* the reader cognitive effort. Most of the time, go with the expectation.

---

## 5. Perception & Cognitive Load

The visual system is powerful but fallible — it relies on statistical regularities and
has hard limits.

### Preattentive attributes

Certain features pop out in parallel, before focused attention kicks in: **color, size,
shape, orientation, enclosure, line width, motion, position**. Use at most one or two
of these to direct attention to the key number or line. A highlighted value in a
gray-on-gray table jumps out; the same number in a rainbow table gets lost.

### Gestalt principles (apply to layout and grouping)

- **Proximity** — group related elements by placing them close together.
- **Similarity** — same color/shape/size implies same category.
- **Enclosure** — a border or background groups elements more strongly than proximity or color.
- **Continuity** — the eye follows smooth lines and aligned edges; align everything on a grid.
- **Closure** — the mind completes broken shapes; leverage this for minimalist marks.
- **Connection** — connected elements are perceived as one unit (stronger than similarity).
- **Figure & ground** — the viewer separates foreground from background; make sure the *data* is figure.
- **Symmetry** — symmetric arrangements are perceived as more stable and grouped.

### Working memory limits

Viewers can compare only ~2–3 items per second, and working memory holds about 4
"chunks." Concrete consequences (Franconeri et al., 2021):

- A 20-bar chart costs ~10 seconds of active comparison work.
- Legends force an eye–memory shift every time the reader looks up a color → **direct-label instead**.
- Highlight or annotate the key differences for the reader instead of making them find them.
- Chunk and summarize rather than showing every datum, unless the dots *are* the point.

### Key takeaways for design

1. Use high-precision channels (position, length) for the most important comparison.
2. Leverage Gestalt and preattentive features deliberately — don't wallpaper with color.
3. Minimize cognitive load: direct labels, pre-computed comparisons, restrained palettes.
4. Design ethically: show patterns, not persuasion.

---

## 6. Five Guidelines for Better Data Visualizations (Schwabish)

A compact checklist every chart should pass:

1. **Show the data.** Let the data be the star. This doesn't mean show *everything* —
   show what *matters*. Highlighting 6 countries in gray-vs-color is more
   informative than a rainbow of 30.
2. **Reduce clutter.** Remove heavy tick marks, borders, 3D effects, redundant gridlines,
   texture fills, overlapping markers. Every pixel that isn't data is a tax on attention.
3. **Integrate graphics and text.** Active titles ("Sales rose 40% in Q3") beat
   descriptive titles ("Sales by Quarter"). Annotations inside the plot area beat
   captions below it. Tell the reader what to see.
4. **Avoid the spaghetti chart.** When you have many series, either (a) highlight the
   ones that matter and gray out the rest, or (b) break into small multiples.
5. **Start with gray.** Default every mark to gray, then add color only where needed
   to direct attention. Color is a scarce resource, not a default.

---

## 7. Color: Using It Well

Color is the encoding most often misused. Three palette families, each for a different
data shape:

- **Qualitative (categorical):** ColorBrewer Set2, Tableau 10, seaborn `colorblind`, HUSL.
  Keep to ≤ 8 distinct colors; beyond that, use shape/position instead.
- **Sequential (ordered):** single-hue gradient (light → dark). Blues, Greens, Oranges,
  Viridis/Plasma (perceptually uniform, colorblind-safe).
- **Diverging (midpoint):** two-hue gradient through neutral (RdBu, PuOr, BrBG).
  Use only when there is a *meaningful zero* (change, difference, deviation).

### Rules

- Always test for colorblind accessibility. ~6% of men are deuteranopic (green-blind),
  ~1% protanopic (red-blind). **Avoid pure red/green pairs.**
- Never use rainbow / jet for quantitative data — it is not perceptually uniform and
  creates false boundaries.
- For highlighting, start gray and use one saturated color for the series of interest.
- Darker/saturated = more important. Muted/gray = context.
- Keep brand colors for accents, not the entire data layer.

### Accessibility

- Minimum text contrast WCAG AA: 4.5:1 for text, 3:1 for large text / graphical elements.
- Encode category with shape *and* color when possible.
- Test with a colorblind simulator (Coblis, Viz Palette).

For palette hex codes and library syntax, load `references/color_palettes.md`.

---

## 8. Avoid Misleading Practices

- **Proportional ink** (Bergstrom & West): the amount of ink used to represent a value
  must be proportional to the value. Trimmed-but-unlabeled bar bases, pictogram
  cheating, and 3D depth all violate this.
- **No truncated axes on bar charts.** Bars encode *length*, so a broken baseline lies
  about ratios. Line charts may break the y-axis if justified and clearly labeled.
- **Dual y-axes** imply correlation that may not exist. Use only when intentional,
  labeled, and both axes are unambiguously tied to specific series.
- **3D charts distort perception** (tilt angle bias); avoid them except for genuinely
  3D data.
- **Choropleths of raw counts** overweight large/populous regions — use rates per capita
  or per area.
- **Cherry-picked time ranges / y-ranges** inflate or hide trends — always ask whether
  the range tells the honest story.
- **Tufte's Lie Factor** = (size of effect shown in graphic) / (size of effect in data).
  Aim for ~1.
- **Show uncertainty** when it exists — confidence intervals, error bars, gradient
  plots, or hypothetical outcome plots. Silently hiding uncertainty is a lie of omission.

---

## 9. Interaction & Storytelling

Modern visualization leans heavily on interaction for exploration and narrative for
explanation.

### Shneiderman's Visual Information-Seeking Mantra (1996)

> **Overview first, zoom and filter, then details on demand.**

The gold-standard pattern for any interactive viz: let readers see the whole, then
drill in. Tooltips, brushing, linked views, and progressive disclosure all serve this.

### Common interaction patterns

- **Zoom / pan** — for large or dense datasets.
- **Filter / subset** — slider, dropdown, or toggle to narrow the scope.
- **Highlight / brush** — select a region in one view, see it highlighted in others.
- **Tooltip** — details on demand for a single mark.
- **Linked views** — small-multiple dashboards with shared selection state.
- **Animated transitions** — help the eye follow changes between states.

### Storytelling structures (Knaflic)

1. **Understand the context** — audience, setting, desired action.
2. **Choose an appropriate display** (see §2).
3. **Eliminate clutter, but not too much** (see §6).
4. **Draw attention where you want it** — preattentive color, bold annotation.
5. **Think like a designer** — the viz must be easy for *this* audience.
6. **Tell a story** — try the **Martini-glass** structure: start with a guided narrative,
   end with free exploration. Lead with the insight; then show the evidence. Use the
   **Five Ws** (who, what, when, where, why).

### Exemplars to study (interactive storytelling)

- **NYT "Coronavirus in the U.S."** — live dashboard with overview → filter → detail.
- **NYT interactive pieces broadly** — annotation-first narrative, scrollytelling.
- **Pudding "Crowdsourcing the Definition of Punk"** — reader becomes data.
- **NZZ "So gefährlich ist Ihre Veloroute"** — map + personal route + risk encoding.
- **Information is Beautiful "Snake Oil"** — interactive bubble chart with evidence strength on y.

Use these as reference calibrations when a user asks for "something like the New York
Times" — the common thread is: one clear question, one clear answer, plenty of
annotation, restrained palette, and a generous amount of white space.

---

## 10. Implementation Workflow

### For code output (Python / R)

```
1. Load & inspect data (shape, dtypes, nulls, ranges)
2. Clean & transform (melt, pivot, aggregate, normalize) — flag the choices
3. Choose chart type (§2)
4. Write minimal working code with chosen library (§3)
5. Apply design principles (§5–§8): labels, restrained color, declutter
6. Add an active title + axis labels with units + source/data credit
7. Add alt-text for accessibility
```

**Every chart ships with:**
- An **active title** ("Germany's schooling grew fastest") beats descriptive ("Years of schooling by country").
- Axis labels with **units**.
- Source / data credit when applicable.
- An **alt-text** description of the finding (for screen readers).

### For inline SVG / HTML artifacts

Use `show_widget` with clean SVG or HTML+JS. Follow the frontend-design skill for
styling. For complex chart-specific SVG patterns, load
`references/chart_types.md`.

Default to Vega-Lite JSON for anything beyond a very simple bar/line chart — it's
declarative, honors the Grammar of Graphics, and renders inline.

### For D3.js (custom web viz)

D3 binds data to DOM/SVG elements and gives you full control. Key concepts (Murray 2017):

- **Selections** (`d3.select`, `d3.selectAll`) — the bridge between data and DOM.
- **Data joins** (`.data().enter().append()`, with `.merge()` and `.exit()`) — add / update / remove marks as data changes.
- **Scales** (`d3.scaleLinear`, `scaleBand`, `scaleTime`, `scaleQuantize`) — map data domain → pixel range.
- **Axes** (`d3.axisBottom(xScale)`) — auto-generate tick marks and labels from scales.
- **Transitions** (`.transition().duration(500)`) — animate between states.
- **Shapes** (`d3.line`, `d3.area`, `d3.arc`, `d3.pie`, `d3.stack`) — path generators.
- **Geo** (`d3.geoPath`, `d3.geoMercator`) — projections and cartography.

Use D3 when no off-the-shelf library meets the design; otherwise prefer Vega-Lite, Plotly, or Observable.

---

## 11. Code Templates

### Python – seaborn bar chart (colorblind palette, decluttered)

```python
import seaborn as sns
import matplotlib.pyplot as plt

sns.set_theme(style="whitegrid", palette="colorblind")
fig, ax = plt.subplots(figsize=(8, 5))
sns.barplot(data=df, x="category", y="value", ax=ax)
ax.set_title("Active title stating the finding", fontsize=14, fontweight="bold")
ax.set_xlabel("Category")
ax.set_ylabel("Value (units)")
sns.despine()
plt.tight_layout()
plt.show()
```

### Python – altair scatter with tooltip

```python
import altair as alt

chart = (
    alt.Chart(df)
    .mark_point()
    .encode(
        x=alt.X("x_col:Q", title="X label (units)"),
        y=alt.Y("y_col:Q", title="Y label (units)"),
        color=alt.Color("category:N", scale=alt.Scale(scheme="tableau10")),
        tooltip=["x_col", "y_col", "category"],
    )
    .properties(title="Active title", width=500, height=350)
    .interactive()
)
chart
```

### R – ggplot2 line chart with confidence band

```r
library(ggplot2)

ggplot(df, aes(x = date, y = value, color = group, fill = group)) +
  geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.2, color = NA) +
  geom_line(linewidth = 1) +
  scale_color_brewer(palette = "Set1") +
  scale_fill_brewer(palette = "Set1") +
  labs(
    title = "Active title",
    x = "Date", y = "Value (units)",
    color = "Group", fill = "Group",
    caption = "Source: ..."
  ) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "bottom")
```

### Vega-Lite – minimal bar chart (inline artifact)

```json
{
  "$schema": "https://vega.github.io/schema/vega-lite/v5.json",
  "data": {"values": []},
  "mark": "bar",
  "encoding": {
    "x": {"field": "category", "type": "nominal", "sort": "-y"},
    "y": {"field": "value", "type": "quantitative"},
    "tooltip": [{"field": "category"}, {"field": "value"}]
  },
  "title": "Active title"
}
```

### D3 – scale + axis skeleton

```js
const xScale = d3.scaleLinear()
  .domain([0, d3.max(data, d => d.value)])
  .range([padding, width - padding]);

const xAxis = d3.axisBottom(xScale).ticks(5);

svg.append("g")
  .attr("class", "axis")
  .attr("transform", `translate(0, ${height - padding})`)
  .call(xAxis);
```

For fuller syntax across ggplot2 / seaborn / altair / matplotlib / Vega-Lite / D3,
load `references/libraries.md`.

---

## 12. Evaluation: Does the Chart Work?

Don't ship a chart without evaluating it. Three complementary methods:

### (a) Heuristic evaluation (fast, no participants — Forsell & Johansson, 2010)

Run this 10-point checklist on the chart:

1. **Information coding** — does the visual encoding match the data?
2. **Minimal actions** — can the reader reach their goal with minimal effort?
3. **Flexibility** — multiple ways to achieve a goal (where relevant)?
4. **Orientation & help** — can users navigate and recover from mistakes?
5. **Spatial organization** — is the layout easy to read?
6. **Consistency** — are similar things presented similarly?
7. **Recognition over recall** — are options visible, or must users remember them?
8. **Prompting** — are available actions clear?
9. **Remove the extraneous** — is there unnecessary visual noise?
10. **Dataset reduction** — can users filter / simplify effectively?

### (b) User observation / think-aloud (Preece, Sharp & Rogers)

Ask representative users to perform real tasks while verbalizing their reasoning.
You will measure:
- **Performance** — accuracy, time, errors
- **Comprehension** — do they arrive at the intended insight?
- **Judgment** — do they trust the chart? Do they prefer it?

Methods include think-aloud, screen recording, structured interviews, post-task
questionnaires, and — for deeper work — eye-tracking.

### (c) Controlled experiments (summative)

A/B test two designs on a defined task. Measure accuracy and response time. Use a
standardized usability instrument (e.g., SUS) when relevant.

### Quick pre-flight checklist

Before presenting any viz, verify:

- [ ] Chart type matches the data relationship (§2).
- [ ] The most important variable is encoded with position.
- [ ] Palette is colorblind-friendly and appropriate (qualitative / sequential / diverging).
- [ ] Axes start at zero (bar charts), or the break is justified and clearly marked.
- [ ] Title states the **finding**, not just the data.
- [ ] Direct labels beat legends where possible.
- [ ] No 3D, no truncated bar baselines, no unjustified dual axes.
- [ ] Uncertainty shown if the data has it.
- [ ] Works at small size (thumbnail legibility).
- [ ] Alt-text describes the insight, not the mechanics.

---

## 13. Working with AI-Generated Visualizations

When helping someone use an AI (including this skill) to build a chart, be honest
about what AI does well and what still needs human judgment.

**AI is good at:** generating boilerplate chart code, modifying existing code,
suggesting alternative chart types, producing titles or captions for simple viz,
writing code templates across libraries.

**Human judgment is still needed for:** identifying the right question from domain
context, choosing visual encodings that match the analytical task, aesthetic and
brand decisions, evaluating whether a chart is truthful and unbiased, ensuring
accessibility, and shaping the narrative. AI can confidently produce misleading or
aesthetically poor charts; understanding the principles above is how you *critically
evaluate, fix, and direct* AI-generated visuals. (Echoing Tukey: visualizations are
analytical instruments, not just outputs.)

---

## 14. Reference Files

Load these when you need more detail:

- `references/chart_types.md` — Deep guide for 25+ chart types with SVG patterns,
  use cases, pitfalls.
- `references/color_palettes.md` — Hex codes for ColorBrewer, HUSL, seaborn palettes;
  colorblind-safe combinations; library syntax.
- `references/libraries.md` — Quick-start syntax for ggplot2, seaborn, altair,
  matplotlib, Vega-Lite, D3.js, plus Bokeh/Plotly links.

**Load the relevant reference file when:**
- The user asks for a less common chart type (Sankey, ridgeline, waffle, cartogram).
- Color selection is central to the task.
- The user wants syntax for a specific library not in §11.

---

## 15. Theoretical Foundations (for attribution & deep dives)

This skill distills ideas from:

- **Bertin, J. (1967)** — *Semiology of Graphics*. Visual variables.
- **Tufte, E. (1983)** — *The Visual Display of Quantitative Information*. Lie Factor, data-ink ratio, "above all else show the data."
- **Cleveland & McGill (1984)** / **Mackinlay (1986)** — Perceptual channel rankings.
- **Wilkinson, L. (1999)** / **Wickham, H. (2010)** — Grammar of Graphics.
- **Shneiderman, B. (1996)** — "The Eyes Have It." Overview → zoom & filter → details.
- **Ware, C. (2020)** — *Information Visualization: Perception for Design*.
- **Wilke, C. O. (2019)** — *Fundamentals of Data Visualization*.
- **Franconeri, S. et al. (2021)** — *The Science of Visual Data Communication*.
- **Knaflic, C. N. (2015)** — *Storytelling with Data*.
- **Schwabish, J. (2021)** — *Better Data Visualizations*. Five guidelines.
- **Murray, S. (2017)** — *Interactive Data Visualization for the Web with D3*.
- **Preece, Sharp & Rogers (2019)** — *Interaction Design*. Usability and evaluation.
- **Bergstrom & West (2020)** — *Calling Bullshit*. Misleading practices.
- **Forsell & Johansson (2010)** — Heuristics for visualization evaluation.

For open datasets to practice with: Kaggle, Awesome Public Datasets (GitHub),
Makeover Monday, Opendata.swiss, OECD Data, World Bank Data360.