--- name: assessment-design-guide description: "Psychometrics and educational assessment design for researchers" metadata: openclaw: emoji: "📋" category: "domains" subcategory: "education" keywords: ["psychometrics", "assessment", "item-response-theory", "test-design", "validity", "reliability"] source: "wentor" --- # Assessment Design Guide A skill for designing, validating, and analyzing educational assessments using modern psychometric methods. Covers classical test theory, item response theory, test construction, validity evidence, and computerized adaptive testing. ## Classical Test Theory ### Reliability Analysis Classical test theory (CTT) models observed scores as the sum of a true score and error: ``` X = T + E ``` Key reliability coefficients: | Coefficient | Method | Interpretation | |-------------|--------|----------------| | Cronbach's alpha | Internal consistency | Homogeneity of items | | Test-retest | Stability over time | Temporal consistency | | Parallel forms | Equivalent test versions | Form equivalence | | Split-half (Spearman-Brown) | Odd-even item split | Internal consistency | | Inter-rater (Cohen's kappa) | Multiple raters | Scoring agreement | ```python import numpy as np import pandas as pd def item_analysis(responses: pd.DataFrame, total_scores: pd.Series) -> pd.DataFrame: """ Classical item analysis: difficulty, discrimination, point-biserial. responses: binary DataFrame (1=correct, 0=incorrect), items as columns. total_scores: total test score for each examinee. """ results = [] for item in responses.columns: scores = responses[item] difficulty = scores.mean() # p-value (proportion correct) # Point-biserial correlation corr = scores.corr(total_scores) # Upper-lower discrimination (top/bottom 27%) n = len(total_scores) cutoff_high = total_scores.quantile(0.73) cutoff_low = total_scores.quantile(0.27) upper = scores[total_scores >= cutoff_high].mean() lower = scores[total_scores <= cutoff_low].mean() discrimination = upper - lower results.append({ "item": item, "difficulty": round(difficulty, 3), "discrimination": round(discrimination, 3), "point_biserial": round(corr, 3), "flag": "review" if difficulty < 0.2 or difficulty > 0.9 or discrimination < 0.2 else "ok" }) return pd.DataFrame(results) ``` ### Item Selection Guidelines - **Difficulty**: Aim for p-values between 0.30 and 0.80 for maximum discrimination - **Discrimination**: Items with D < 0.20 should be revised or removed - **Distractors**: Each distractor should attract at least 5% of examinees - **Point-biserial**: Should be positive and ideally above 0.25 ## Item Response Theory ### The Three-Parameter Logistic Model IRT provides a more rigorous framework than CTT by modeling the probability of a correct response as a function of ability and item parameters: ```python import numpy as np def irt_3pl(theta: float, a: float, b: float, c: float) -> float: """ Three-parameter logistic IRT model. theta: examinee ability (typically -3 to +3) a: discrimination parameter (slope, typically 0.5 to 2.5) b: difficulty parameter (location, same scale as theta) c: guessing parameter (lower asymptote, typically 0.0 to 0.35) Returns: probability of correct response """ exponent = -a * (theta - b) return c + (1 - c) / (1 + np.exp(exponent)) # Item characteristic curves for three items thetas = np.linspace(-3, 3, 100) item_easy = [irt_3pl(t, a=1.0, b=-1.0, c=0.2) for t in thetas] item_medium = [irt_3pl(t, a=1.5, b=0.0, c=0.2) for t in thetas] item_hard = [irt_3pl(t, a=1.2, b=1.5, c=0.2) for t in thetas] ``` ### IRT Model Estimation ```python # Using the 'mirt' package in R (called via rpy2 or standalone) # R code for fitting a 2PL model: r_code = """ library(mirt) # responses: binary matrix (examinees x items) mod <- mirt(responses, model = 1, itemtype = "2PL") # Item parameters coef(mod, simplify = TRUE) # Ability estimates (Expected A Posteriori) theta_hat <- fscores(mod, method = "EAP") # Model fit M2(mod) # limited-information fit statistic itemfit(mod, fit_stats = "S_X2") """ ``` ### Model Comparison | Model | Parameters | Use Case | |-------|-----------|----------| | Rasch (1PL) | b only | Equal discrimination assumed; measurement-focused | | 2PL | a, b | Different discrimination; general purpose | | 3PL | a, b, c | Multiple choice with guessing | | Graded Response | a, b_k | Likert-scale or partial credit items | | Nominal Response | a_k, c_k | Multiple choice with informative distractors | ## Validity Evidence ### The Unified Validity Framework Following the Standards for Educational and Psychological Testing (AERA/APA/NCME, 2014), validity is a unitary concept supported by five types of evidence: 1. **Content evidence**: Expert review confirms items represent the construct domain 2. **Response process evidence**: Think-aloud protocols confirm examinees engage intended cognitive processes 3. **Internal structure evidence**: Factor analysis confirms dimensionality matches the test blueprint 4. **Relations to other variables**: Correlations with external criteria (convergent, discriminant, predictive) 5. **Consequences evidence**: Test use leads to intended benefits without unintended harm ```python from factor_analyzer import FactorAnalyzer # Confirmatory approach: check dimensionality fa = FactorAnalyzer(n_factors=3, rotation="promax") fa.fit(item_responses) # Eigenvalues for scree plot eigenvalues, _ = fa.get_eigenvalues() print("Eigenvalues:", eigenvalues[:10]) # Factor loadings loadings = pd.DataFrame( fa.loadings_, columns=["Factor1", "Factor2", "Factor3"], index=item_names ) print(loadings.round(3)) ``` ## Computerized Adaptive Testing ### CAT Algorithm Computerized adaptive testing selects items in real time to match examinee ability: ``` Initialize: theta_0 = 0 (prior mean) For each item i = 1, 2, ..., until stopping rule met: 1. Select item with maximum Fisher information at current theta 2. Administer item, observe response 3. Update theta estimate using maximum likelihood or Bayesian EAP 4. Check stopping rule: - Fixed length (e.g., 30 items) - SE(theta) < threshold (e.g., 0.30) - Maximum time reached Return: final theta estimate and standard error ``` ### Item Exposure Control To prevent overuse of high-quality items and maintain test security: - **Sympson-Hetter method**: Set maximum exposure rates per item (e.g., 0.25) - **a-stratified method**: Divide item bank into strata by discrimination, sample within strata - **Shadow test approach**: Assemble full shadow tests at each step, administer the optimal item from the shadow test ## Tools and Software - **R mirt package**: Full-featured IRT estimation, DIF analysis, CAT simulation - **Python irt library (py-irt)**: Bayesian IRT models using PyTorch - **jMetrik**: Open-source Java application for classical and IRT analysis - **TAO (Testing Assistee par Ordinateur)**: Open-source assessment delivery platform - **Concerto**: Open-source adaptive testing platform from Cambridge ## Key References - Embretson, S.E. and Reise, S.P. (2000). *Item Response Theory for Psychologists*. Lawrence Erlbaum. - de Ayala, R.J. (2022). *The Theory and Practice of Item Response Theory* (2nd ed.). Guilford Press. - AERA, APA, and NCME (2014). *Standards for Educational and Psychological Testing*.