--- name: data-journalism description: Data journalism workflows for analysis, visualization, and storytelling. Use when analyzing datasets, creating charts and maps, cleaning messy data, calculating statistics, or building data-driven stories. Essential for reporters, newsrooms, and researchers working with quantitative information. --- # Data journalism methodology Systematic approaches for finding, analyzing, and presenting data in journalism. ## Data acquisition ### Public data sources ```markdown ## Federal data sources ### General - Data.gov - Federal open data portal - Census Bureau (census.gov) - Demographics, economic data - BLS (bls.gov) - Employment, inflation, wages - BEA (bea.gov) - GDP, economic accounts - Federal Reserve (federalreserve.gov) - Financial data - SEC EDGAR - Corporate filings ### Specific domains - EPA (epa.gov/data) - Environmental data - FDA (fda.gov/data) - Drug approvals, recalls, adverse events - CDC WONDER - Health statistics - NHTSA - Vehicle safety data - DOT - Transportation statistics - FEC - Campaign finance - USASpending.gov - Federal contracts and grants ### State and local - State open data portals (search: "[state] open data") - Socrata-powered sites (many cities/states) - OpenStreets, municipal GIS portals - State comptroller/auditor reports ``` ### Data request strategies ```markdown ## Getting data that isn't public ### FOIA for datasets - Request databases, not just documents - Ask for data dictionary/schema - Request in native format (CSV, SQL dump) - Specify field-level needs ### Building your own dataset - Scraping public information - Crowdsourcing from readers - Systematic document review - Surveys (with proper methodology) ### Commercial data sources (for newsrooms) - LexisNexis - Refinitiv - Bloomberg - Industry-specific databases ``` ## Data cleaning and preparation ### Common data problems ```python import pandas as pd import numpy as np # Load messy data df = pd.read_csv('raw_data.csv') # 1. INCONSISTENT FORMATTING # Problem: Names in different formats # "SMITH, JOHN" vs "John Smith" vs "smith john" def standardize_name(name): """Standardize name format to 'First Last'.""" if pd.isna(name): return None name = str(name).strip().lower() # Handle "LAST, FIRST" format if ',' in name: parts = name.split(',') name = f"{parts[1].strip()} {parts[0].strip()}" return name.title() df['name_clean'] = df['name'].apply(standardize_name) # 2. DATE INCONSISTENCIES # Problem: Dates in multiple formats # "01/15/2024", "2024-01-15", "January 15, 2024", "15-Jan-24" def parse_date(date_str): """Parse dates in various formats.""" if pd.isna(date_str): return None formats = [ '%m/%d/%Y', '%Y-%m-%d', '%B %d, %Y', '%d-%b-%y', '%m-%d-%Y', '%Y/%m/%d' ] for fmt in formats: try: return pd.to_datetime(date_str, format=fmt) except: continue # Fall back to pandas parser try: return pd.to_datetime(date_str) except: return None df['date_clean'] = df['date'].apply(parse_date) # 3. MISSING VALUES # Strategy depends on context # Check missing value patterns print(df.isnull().sum()) print(df.isnull().sum() / len(df) * 100) # Percentage # Options: # - Drop rows with critical missing values df_clean = df.dropna(subset=['required_field']) # - Fill with appropriate values df['category'] = df['category'].fillna('Unknown') df['amount'] = df['amount'].fillna(df['amount'].median()) # - Flag as missing (preserve for analysis) df['amount_missing'] = df['amount'].isna() # 4. DUPLICATES # Find and handle duplicates # Exact duplicates print(f"Exact duplicates: {df.duplicated().sum()}") df = df.drop_duplicates() # Fuzzy duplicates (similar but not identical) # Use record linkage or manual review from fuzzywuzzy import fuzz def find_similar_names(names, threshold=85): """Find potentially duplicate names.""" duplicates = [] for i, name1 in enumerate(names): for j, name2 in enumerate(names[i+1:], i+1): score = fuzz.ratio(str(name1).lower(), str(name2).lower()) if score >= threshold: duplicates.append((name1, name2, score)) return duplicates # 5. OUTLIERS # Identify potential data entry errors def flag_outliers(series, method='iqr', threshold=1.5): """Flag statistical outliers.""" if method == 'iqr': Q1 = series.quantile(0.25) Q3 = series.quantile(0.75) IQR = Q3 - Q1 lower = Q1 - threshold * IQR upper = Q3 + threshold * IQR return (series < lower) | (series > upper) elif method == 'zscore': z_scores = np.abs((series - series.mean()) / series.std()) return z_scores > threshold df['amount_outlier'] = flag_outliers(df['amount']) print(f"Outliers found: {df['amount_outlier'].sum()}") # 6. DATA TYPE CORRECTIONS # Ensure proper types for analysis # Convert to numeric (handling errors) df['amount'] = pd.to_numeric(df['amount'], errors='coerce') # Convert to categorical (saves memory, enables ordering) df['status'] = pd.Categorical(df['status'], categories=['Pending', 'Active', 'Closed'], ordered=True) # Convert to datetime df['date'] = pd.to_datetime(df['date'], errors='coerce') ``` ### Data validation checklist ```markdown ## Pre-analysis data validation ### Structural checks - [ ] Row count matches expected - [ ] Column count and names correct - [ ] Data types appropriate - [ ] No unexpected null columns ### Content checks - [ ] Date ranges make sense - [ ] Numeric values within expected bounds - [ ] Categorical values match expected options - [ ] Geographic data resolves correctly - [ ] IDs are unique where expected ### Consistency checks - [ ] Totals add up to expected values - [ ] Cross-tabulations balance - [ ] Related fields are consistent - [ ] Time series is continuous ### Source verification - [ ] Can trace back to original source - [ ] Methodology documented - [ ] Known limitations noted - [ ] Update frequency understood ``` ## Statistical analysis for journalism ### Basic statistics with context ```python # Essential statistics for any dataset def describe_for_journalism(df, column): """Generate journalist-friendly statistics.""" stats = { 'count': len(df[column].dropna()), 'missing': df[column].isna().sum(), 'min': df[column].min(), 'max': df[column].max(), 'mean': df[column].mean(), 'median': df[column].median(), 'std': df[column].std(), } # Percentiles for context stats['25th_percentile'] = df[column].quantile(0.25) stats['75th_percentile'] = df[column].quantile(0.75) stats['90th_percentile'] = df[column].quantile(0.90) stats['99th_percentile'] = df[column].quantile(0.99) # Distribution shape stats['skewness'] = df[column].skew() return stats # Example interpretation stats = describe_for_journalism(df, 'salary') print(f""" SALARY ANALYSIS --------------- We analyzed {stats['count']:,} salary records. The median salary is ${stats['median']:,.0f}, meaning half of workers earn more and half earn less. The average salary is ${stats['mean']:,.0f}, which is {'higher' if stats['mean'] > stats['median'] else 'lower'} than the median, indicating the distribution is {'right-skewed (pulled up by high earners)' if stats['skewness'] > 0 else 'left-skewed'}. The top 10% of earners make at least ${stats['90th_percentile']:,.0f}. The top 1% make at least ${stats['99th_percentile']:,.0f}. """) ``` ### Comparisons and context ```python # Year-over-year change def calculate_change(current, previous): """Calculate change with multiple metrics.""" absolute = current - previous if previous != 0: percent = (current - previous) / previous * 100 else: percent = float('inf') if current > 0 else 0 return { 'current': current, 'previous': previous, 'absolute_change': absolute, 'percent_change': percent, 'direction': 'increased' if absolute > 0 else 'decreased' if absolute < 0 else 'unchanged' } # Per capita calculations (essential for fair comparisons) def per_capita(value, population): """Calculate per capita rate.""" return (value / population) * 100000 # Per 100,000 is standard # Example: Crime rates city_a = {'crimes': 5000, 'population': 100000} city_b = {'crimes': 8000, 'population': 500000} rate_a = per_capita(city_a['crimes'], city_a['population']) rate_b = per_capita(city_b['crimes'], city_b['population']) print(f"City A: {rate_a:.1f} crimes per 100,000 residents") print(f"City B: {rate_b:.1f} crimes per 100,000 residents") # City A actually has higher crime rate despite fewer total crimes! # Inflation adjustment def adjust_for_inflation(amount, from_year, to_year, cpi_data): """Adjust dollar amounts for inflation.""" from_cpi = cpi_data[from_year] to_cpi = cpi_data[to_year] return amount * (to_cpi / from_cpi) # Always adjust when comparing dollars across years! ``` ### Correlation vs causation ```markdown ## Reporting correlations responsibly ### What you CAN say - "X and Y are correlated" - "As X increases, Y tends to increase" - "Areas with higher X also tend to have higher Y" - "X is associated with Y" ### What you CANNOT say (without more evidence) - "X causes Y" - "X leads to Y" - "Y happens because of X" ### Questions to ask before implying causation 1. Is there a plausible mechanism? 2. Does the timing make sense (cause before effect)? 3. Is there a dose-response relationship? 4. Has the finding been replicated? 5. Have confounding variables been controlled? 6. Are there alternative explanations? ### Red flags for spurious correlations - Extremely high correlation (r > 0.95) with unrelated things - No logical connection between variables - Third variable could explain both - Small sample size with high variance ``` ## Data visualization ### Chart selection guide ```markdown ## Choosing the right chart ### Comparison - **Bar chart**: Compare categories - **Grouped bar**: Compare categories across groups - **Bullet chart**: Actual vs target ### Change over time - **Line chart**: Trends over time - **Area chart**: Cumulative totals over time - **Slope chart**: Change between two points ### Distribution - **Histogram**: Distribution of one variable - **Box plot**: Compare distributions across groups - **Violin plot**: Detailed distribution shape ### Relationship - **Scatter plot**: Relationship between two variables - **Bubble chart**: Three variables (x, y, size) - **Connected scatter**: Change in relationship over time ### Composition - **Pie chart**: Parts of a whole (use sparingly, max 5 slices) - **Stacked bar**: Parts of whole across categories - **Treemap**: Hierarchical composition ### Geographic - **Choropleth**: Values by region (use normalized data!) - **Dot map**: Individual locations - **Proportional symbol**: Magnitude at locations ``` ### Visualization best practices ```python import matplotlib.pyplot as plt import seaborn as sns # Journalist-friendly chart defaults plt.rcParams.update({ 'figure.figsize': (10, 6), 'font.size': 12, 'axes.titlesize': 16, 'axes.labelsize': 12, 'axes.spines.top': False, 'axes.spines.right': False, }) def create_bar_chart(data, title, source, xlabel='', ylabel=''): """Create a publication-ready bar chart.""" fig, ax = plt.subplots() # Create bars bars = ax.bar(data.keys(), data.values(), color='#2c7bb6') # Add value labels on bars for bar in bars: height = bar.get_height() ax.annotate(f'{height:,.0f}', xy=(bar.get_x() + bar.get_width() / 2, height), ha='center', va='bottom', fontsize=10) # Labels and title ax.set_title(title, fontweight='bold', pad=20) ax.set_xlabel(xlabel) ax.set_ylabel(ylabel) # Add source annotation fig.text(0.99, 0.01, f'Source: {source}', ha='right', va='bottom', fontsize=9, color='gray') plt.tight_layout() return fig # Example data = {'2020': 1200, '2021': 1450, '2022': 1380, '2023': 1620} fig = create_bar_chart(data, 'Annual Widget Production', 'Department of Widgets, 2024', ylabel='Units produced') fig.savefig('chart.png', dpi=150, bbox_inches='tight') ``` ### Avoiding misleading visualizations ```markdown ## Chart integrity checklist ### Axes - [ ] Y-axis starts at zero (for bar charts) - [ ] Axis labels are clear - [ ] Scale is appropriate (not truncated to exaggerate) - [ ] Both axes labeled with units ### Data representation - [ ] All data points visible - [ ] Colors are distinguishable (including colorblind) - [ ] Proportions are accurate - [ ] 3D effects not distorting perception ### Context - [ ] Title describes what's shown, not conclusion - [ ] Time period clearly stated - [ ] Source cited - [ ] Sample size/methodology noted if relevant - [ ] Uncertainty shown where appropriate ### Honesty - [ ] Cherry-picking dates avoided - [ ] Outliers explained, not hidden - [ ] Dual axes justified (usually avoid) - [ ] Annotations don't mislead ``` ## Story structure for data journalism ### Data story framework ```markdown ## The data story arc ### 1. The hook (nut graf) - What's the key finding? - Why should readers care? - What's the human impact? ### 2. The evidence - Show the data - Explain the methodology - Acknowledge limitations ### 3. The context - How does this compare to past? - How does this compare to elsewhere? - What's the trend? ### 4. The human element - Individual examples that illustrate the data - Expert interpretation - Affected voices ### 5. The implications - What does this mean going forward? - What questions remain? - What actions could result? ### 6. The methodology box - Where did data come from? - How was it analyzed? - What are the limitations? - How can readers explore further? ``` ### Methodology documentation template ```markdown ## How we did this analysis ### Data sources [List all data sources with links and access dates] ### Time period [Specify exactly what time period is covered] ### Definitions [Define key terms and how you operationalized them] ### Analysis steps 1. [First step of analysis] 2. [Second step] 3. [Continue...] ### Limitations - [Limitation 1] - [Limitation 2] ### What we excluded and why - [Excluded category]: [Reason] ### Verification [How findings were verified/checked] ### Code and data availability [Link to GitHub repo if sharing code/data] ### Contact [How readers can reach you with questions] ``` ## Tools and resources ### Essential tools | Tool | Purpose | Cost | |------|---------|------| | Python + pandas | Data analysis | Free | | R + tidyverse | Statistical analysis | Free | | Excel/Sheets | Quick analysis | Free/Low | | Datawrapper | Charts for web | Free tier | | Flourish | Interactive viz | Free tier | | QGIS | Mapping | Free | | Tabula | PDF table extraction | Free | | OpenRefine | Data cleaning | Free | ### Learning resources - NICAR (Investigative Reporters & Editors) - Knight Center for Journalism in the Americas - Data Journalism Handbook (datajournalism.com) - Flowing Data (flowingdata.com) - The Pudding (pudding.cool) - examples