---
name: ssr
description: Semantic Similarity Rating - elicit realistic Likert-scale responses from LLMs using textual elicitation and embedding similarity mapping. Use when you need survey-like responses, purchase intent ratings, relevance scores, or any Likert-scale measurement that should match human response distributions.
---

# Semantic Similarity Rating (SSR)

SSR is a method for eliciting realistic Likert-scale responses from LLMs. Instead of asking for direct numerical ratings (which produce unrealistic, narrow distributions), SSR:

1. Elicits free-text responses about the subject
2. Maps those responses to Likert scale distributions using embedding similarity

This achieves ~90% of human test-retest reliability while maintaining realistic response distributions (KS similarity > 0.85).

## When to Use SSR

- Consumer research / purchase intent surveys
- Product concept evaluation
- Relevance or satisfaction ratings
- Any Likert-scale measurement where you need realistic distributions
- When you need qualitative feedback alongside quantitative scores

## The Problem with Direct Likert Rating

When LLMs are asked directly for Likert ratings (1-5), they:
- Regress to "safe" middle values (mostly 3s)
- Produce unrealistically narrow distributions
- Rarely use extreme values (1 or 5)
- Lose the nuance of their actual assessment

## SSR Method

### Step 1: Create Synthetic Consumer Persona

Prompt the LLM to impersonate a consumer with specific demographic attributes:

```
You are participating in a consumer research survey. You are a [age]-year-old [gender] living in [region] with [income level description].

You will be shown a product concept and asked about your purchase intent. Respond naturally and briefly as this person would.
```

**Key demographics to include:**
- Age (influences purchase intent significantly)
- Gender
- Income level (strongly influences purchase intent)
- Region/location
- Ethnicity (optional, less consistent influence)

### Step 2: Present Stimulus and Elicit Free-Text Response

Show the product concept (image or text) and ask:

```
How likely would you be to purchase this product?

Reply briefly to any questions posed to you.
```

**Do NOT constrain the response to a number.** Let the LLM respond naturally, e.g.:
- "I'm somewhat interested. If it works well and isn't too expensive, I might give it a try."
- "Seems kinda bougie for this kind of product. I'll stick with what I know."
- "The ease of use and safety are appealing, but I'd want to know more about effectiveness."

### Step 3: Map Response to Likert Distribution via Embedding Similarity

#### Reference Statement Sets

Create anchor statements for each Likert value. Use multiple sets (recommended: 6) and average results:

**Set 1 - Direct likelihood:**
```
1: "It's very unlikely I'd buy it."
2: "It's rather unlikely I'd buy it."
3: "I'm not sure if I'd buy it."
4: "It's rather likely I'd buy it."
5: "It's very likely I'd buy it."
```

**Set 2 - Intent phrasing:**
```
1: "I definitely would not purchase this."
2: "I probably would not purchase this."
3: "I might or might not purchase this."
4: "I probably would purchase this."
5: "I definitely would purchase this."
```

**Set 3 - Interest-based:**
```
1: "I have no interest in buying this."
2: "I have little interest in buying this."
3: "I have some interest in buying this."
4: "I have considerable interest in buying this."
5: "I have strong interest in buying this."
```

**Set 4 - Casual phrasing:**
```
1: "No way I'd buy this."
2: "Probably wouldn't buy this."
3: "Maybe I'd buy this, maybe not."
4: "Yeah, I'd probably buy this."
5: "For sure I'd buy this."
```

**Set 5 - Conditional phrasing:**
```
1: "I wouldn't buy this under any circumstances."
2: "I'd need a lot of convincing to buy this."
3: "I could see myself buying this in the right situation."
4: "I'd likely buy this if I saw it in stores."
5: "I'd definitely buy this as soon as it's available."
```

**Set 6 - Recommendation framing:**
```
1: "I would actively avoid this product."
2: "I wouldn't recommend this product."
3: "This product seems okay."
4: "I would consider recommending this product."
5: "I would enthusiastically recommend this product."
```

#### Compute Similarity Scores

1. Get embedding vectors for:
   - The synthetic response: `v_response`
   - Each reference statement: `v_ref[1..5]`

2. Compute cosine similarity for each reference:
   ```
   similarity[r] = (v_response · v_ref[r]) / (|v_response| × |v_ref[r]|)
   ```

3. Convert to probability distribution:
   ```
   # Subtract minimum to create contrast
   min_sim = min(similarity[1..5])
   adjusted[r] = similarity[r] - min_sim

   # Normalize to probability distribution
   p[r] = adjusted[r] / sum(adjusted[1..5])
   ```

4. Average across all reference sets for final distribution

#### Embedding Model

Use OpenAI's `text-embedding-3-small` (or `text-embedding-3-large` for marginal improvement).

### Step 4: Aggregate Results

For a synthetic survey panel:
- Generate multiple synthetic consumers with varied demographics
- Collect response distributions from each
- Aggregate into survey-level distributions
- Calculate mean purchase intent: `PI = sum(r × p[r])` for r in 1..5

## Implementation Notes

### Temperature Settings

- LLM temperature: 0.5 works well (0.5-1.5 range tested)
- Generate 2 samples per consumer and average for stability

### Demographics Matter

Without demographics, LLMs:
- Achieve high distributional similarity (~0.91 KS)
- But poor correlation attainment (~50%)
- They rate everything positively without discriminating

With demographics:
- Better correlation attainment (~90%)
- LLMs properly differentiate between product concepts
- Age and income have strongest influence on response patterns

### Image vs Text Stimulus

- Image stimulus (product concept slides) performs slightly better
- Text-only descriptions work but with mild performance reduction
- For text stimulus, transcribe key information from product concepts

## Success Metrics

### Distributional Similarity (KS Similarity)

```
KS_similarity = 1 - max|F_real(r) - F_synthetic(r)|
```

Target: > 0.85

### Correlation Attainment

Compare synthetic-real correlation to human test-retest reliability:

```
ρ = E[R_xy] / E[R_xx]
```

Where:
- R_xy = correlation between synthetic and real mean purchase intents
- R_xx = correlation between split-half human samples (theoretical maximum)

Target: > 90%

## Alternative: Follow-up Likert Rating (FLR)

A simpler alternative that performs reasonably well:

1. Elicit free-text response (same as SSR)
2. Prompt a second LLM instance as a "Likert rating expert"
3. Have it map the text response to a single integer 1-5

FLR achieves:
- ~85% correlation attainment
- ~0.72 KS similarity (worse than SSR's 0.88)

Use SSR when distribution realism matters; FLR when you only need ranking.

## Qualitative Benefits

SSR's textual responses provide rich qualitative feedback:

**Positive feedback example:**
"The ease of use and the promise of no sensitivity are appealing. Plus, it's from a trusted brand."

**Critical feedback example:**
"It seems a bit too high-end for my needs and budget."
"Sounds expensive, and I'm not sure I buy all that 'microbiome' talk."

This qualitative data can inform product development beyond just ratings.

## Limitations

1. **Reference set sensitivity**: Different anchor sets produce slightly different mappings. Average across multiple sets.

2. **Domain dependency**: Works best for domains well-represented in LLM training data (consumer products, general topics). May hallucinate for obscure domains.

3. **Demographic fidelity**: Age and income patterns replicate well. Gender and region patterns are less consistent.

4. **Not a replacement**: SSR augments human research; it shouldn't fully replace human panels for final decisions.

## Quick Reference

| Method | Correlation Attainment | KS Similarity |
|--------|----------------------|---------------|
| Direct Likert Rating | ~80% | 0.26-0.39 |
| Follow-up Likert Rating | ~85% | 0.59-0.72 |
| **SSR** | **~90%** | **0.80-0.88** |
| Human test-retest | 100% (by definition) | 1.0 |

## Example Workflow

```python
# Pseudocode for SSR implementation

def ssr_rating(product_concept, demographics):
    # Step 1: Create persona prompt
    persona = create_persona_prompt(demographics)

    # Step 2: Elicit free-text response
    response = llm.generate(
        system=persona,
        user=f"[Product concept: {product_concept}]\n\nHow likely would you be to purchase this product?",
        temperature=0.5
    )

    # Step 3: Get embeddings
    response_embedding = embed(response)

    # Step 4: Compute distribution across all reference sets
    distributions = []
    for ref_set in REFERENCE_SETS:
        ref_embeddings = [embed(stmt) for stmt in ref_set]
        similarities = [cosine_similarity(response_embedding, ref_emb)
                       for ref_emb in ref_embeddings]

        # Normalize to distribution
        min_sim = min(similarities)
        adjusted = [s - min_sim for s in similarities]
        total = sum(adjusted)
        distribution = [a / total for a in adjusted]
        distributions.append(distribution)

    # Average across reference sets
    final_distribution = average(distributions)

    return {
        'distribution': final_distribution,
        'mean_pi': sum((r+1) * p for r, p in enumerate(final_distribution)),
        'qualitative_response': response
    }
```

## References

Maier, B.F., et al. (2025). "LLMs Reproduce Human Purchase Intent via Semantic Similarity Elicitation of Likert Ratings." arXiv:2510.08338v2

GitHub implementation: https://github.com/pymc-labs/semantic-similarity-rating