# Join Strategy: From Hash-Join to Leapfrog Triejoin This document explains how GQLite evaluates queries that involve multiple patterns joined together, why the naive approach doesn't scale, and how Leapfrog Triejoin (LTJ) solves the problem. It is based on the paper CompactLTJ (Arroyuelo et al., VLDBJ 2025), with a reference C++ implementation in `../cltj/`. ## 1. What is a join in GQL? A join happens when a query mentions the same variable in two or more patterns. The runtime must find values for that variable that satisfy all patterns simultaneously. Consider this graph (a social network where people follow each other): ``` Alice ──follows──► Bob ──follows──► Carol ▲ │ └────────── follows ────────────────┘ ``` The query "find triangles" asks for three people where each follows the next: ``` (a) -[:follows]-> (b), (b) -[:follows]-> (c), (c) -[:follows]-> (a) ``` Here, variable `b` appears in the first and second pattern, and `a` appears in the first and third. The runtime must find values of `a`, `b`, `c` that satisfy all three constraints at once. That's a multi-way join. ## 2. The naive approach: pairwise hash-join The simplest strategy is to evaluate two patterns at a time: ``` Step 1: Find all (a,b) pairs where a follows b → produces a big table of pairs Step 2: Find all (b,c) pairs where b follows c → produces another big table Step 3: Join the two tables on shared variable b → for each (a,b) in step 1, find matching (b,c) in step 2 → produces triples (a,b,c) Step 4: Filter: keep only triples where c follows a ``` This works, but has a fatal flaw: **intermediate blowup**. ### The blowup problem, illustrated Suppose each person follows 100 others (degree 100). Let's count: ``` Step 1: (a,b) pairs = 100 per person × N persons = 100N rows Step 2: (b,c) pairs = 100 per person × N persons = 100N rows Step 3: Join on b = for each (a,b), match with all (b,c) = up to 100 × 100 = 10,000 triples per value of b = 10,000N rows in the intermediate result Step 4: Filter c→a: most of these 10,000N rows get thrown away ``` We built a table with 10,000N rows only to discard most of it. For a 4-clique (6 sub-queries), the intermediate tables grow to millions or billions of rows. In our benchmarks, 4-clique simply hangs. ### Benchmark results with pairwise hash-join (before LTJ) | Query shape | Time (100K edges, limit=1000) | |-------------|-------------------------------| | 3-clique | 0.57s | | 4-cycle | 4.15s | | 3-path | 6.33s | | 2-tree | 101.8s (timeout) | | 4-path | 159.8s (timeout) | | 4-clique | never finishes | ## 3. Leapfrog Triejoin: the idea LTJ avoids intermediate blowup by binding variables **one at a time** and checking all constraints immediately. Instead of "find all pairs, then filter", it does "pick a candidate for `a`, then for `b`, then check right away if it works". ### Triangle example, step by step Query: "Find all triangles `(a)->(b), (b)->(c), (c)->(a)`" ``` for each candidate a: for each b such that a→b exists: for each c such that b→c AND c→a both exist: ← this is the key: we check c→a immediately, not later emit (a, b, c) ``` The critical difference: when considering candidates for `c`, LTJ already knows that `c` must appear in both `out(b)` (from pattern 2) and `in(a)` (from pattern 3). It **intersects** these two lists right now, instead of building a huge table and filtering later. If `b` has 100 outgoing edges and `a` has 5 incoming edges, the intersection produces at most 5 candidates for `c`, not 100. No blowup. ### How the intersection works: leapfrog Suppose we need to find values of `c` that appear in both list A (sorted) and list B (sorted): ``` List A (out(b)): [3, 7, 12, 15, 20, 25, 30] List B (in(a)): [5, 12, 18, 25, 40] ``` The leapfrog algorithm alternates between the two lists: ``` 1. Start: A seeks ≥ 0 → finds 3. B seeks ≥ 3 → finds 5. (3 ≠ 5) 2. Rotate: A seeks ≥ 5 → finds 7. B seeks ≥ 7 → finds 12. (7 ≠ 12) 3. Rotate: A seeks ≥ 12 → finds 12. B seeks ≥ 12 → finds 12. (12 = 12) ← match! 4. Next: A seeks ≥ 13 → finds 15. B seeks ≥ 15 → finds 18. (15 ≠ 18) 5. Rotate: A seeks ≥ 18 → finds 20. B seeks ≥ 20 → finds 25. (20 ≠ 25) 6. Rotate: A seeks ≥ 25 → finds 25. B seeks ≥ 25 → finds 25. (25 = 25) ← match! 7. Next: A seeks ≥ 26 → finds 30. B seeks ≥ 30 → finds 40. (30 ≠ 40) 8. Rotate: A seeks ≥ 40 → end of A. Done. ``` Result: `c ∈ {12, 25}`. We found the intersection without scanning every element. Each "seek" is a binary search: O(log n). With 3 or more lists, the algorithm rotates through all of them, always seeking forward to at least the current maximum. It converges quickly because each seek jumps ahead, skipping large portions of each list. ## 4. From graph queries to triples LTJ was originally designed for RDF databases, where data is stored as `(subject, predicate, object)` triples. We adapt it to property graphs by treating each directed edge as a triple: ``` Edge: Alice ──follows──► Bob Triple: (Alice, follows, Bob) Edge: p1 ──Transfer──► p2 Triple: (p1, Transfer, p2) ``` In general: `(source_node, edge_label, target_node)`. ### Query transformation The runtime transforms GQL patterns into triple patterns. Each variable in the query becomes a variable in a triple, and each concrete label becomes a constant. **Example: `(a)-[:Transfer]->(b), (b)-[:Transfer]->(c)`** ``` Triple 1: (a, Transfer, b) ← "a" sends a Transfer to "b" Triple 2: (b, Transfer, c) ← "b" sends a Transfer to "c" ``` Variable `b` is shared between both triples. LTJ will bind `b` to a value that satisfies both. **Example: `(x)->(y)->(z)` (no label, concat chain)** A concatenation of edges is also a join. The middle node `y` must be the target of the first edge and the source of the second. This is the same as a comma-join on `y`: ``` Triple 1: (x, _p0, y) ← some edge from x to y (any label) Triple 2: (y, _p1, z) ← some edge from y to z (any label) ``` Variables `_p0` and `_p1` are fresh internal variables for the unknown labels. LTJ treats them as additional variables to bind, which works naturally: it just means "find any label that fits". **Example: anonymous nodes `()-[]->()-[]->()`** Nodes without variable names get fresh internal names (`_ltj_0`, etc.): ``` Triple 1: (_ltj_0, _ltj_2, _ltj_1) Triple 2: (_ltj_1, _ltj_3, _ltj_4) ``` The internal names are excluded from the final result. ## 5. The 6 sorted orderings LTJ needs to search triples efficiently by any combination of components. Sometimes it knows the source and wants the target; sometimes it knows the target and wants the source; sometimes it knows the label and wants both endpoints. To support this, we store 6 copies of all triples, each sorted by a different permutation: ``` Name Sorted by Used when you know... ──── ───────── ────────────────────── SPO (src, label, tgt) src, then label → find targets SOP (src, tgt, label) src, then tgt → find labels POS (label, tgt, src) label, then tgt → find sources PSO (label, src, tgt) label, then src → find targets OSP (tgt, src, label) tgt, then src → find labels OPS (tgt, label, src) tgt, then label → find sources ``` Think of each ordering as a phone book sorted differently. If you know the last name, the standard phone book works. If you know the city first, you need a different sorting. With 6 orderings, any prefix query is a simple binary search. ### Concrete example (fraud graph, 5 edges) ``` Edges: p1──Transfer──►p2, p2──Transfer──►a2, a2──Transfer──►a1, a1──Transfer──►p1, a1──Foo──►d1 Using internal IDs: p1=2, p2=3, a2=1, a1=0, d1=4 Transfer=0, Foo=1 SPO ordering (sorted by src, then label, then tgt): (0, 0, 2) ← a1 ──Transfer──► p1 (0, 1, 4) ← a1 ──Foo──► d1 (1, 0, 0) ← a2 ──Transfer──► a1 (2, 0, 3) ← p1 ──Transfer──► p2 (3, 0, 1) ← p2 ──Transfer──► a2 ``` To find "all targets of Transfer edges from node 0 (a1)": binary search for prefix (0, 0, ?) → finds (0, 0, 2) → target is p1. To find "all sources that reach node 0 (a1)", we need the OSP ordering (sorted by tgt first): binary search for prefix (0, ?, ?) → finds all edges ending at a1. ## 6. The algorithm in action: full worked example Let's trace LTJ on a triangle query over a small graph. **Graph** (directed edges only): ``` 0 → 1, 0 → 2, 1 → 2, 1 → 0, 2 → 0, 2 → 1 ``` **Query**: `(a)->(b), (b)->(c), (c)->(a)` — find all triangles. **Triples** (labels omitted for simplicity): ``` Triple 1: (a, *, b) ← a→b Triple 2: (b, *, c) ← b→c Triple 3: (c, *, a) ← c→a ``` **Variable order (VEO)**: a, b, c (all non-lonely). **Search**: ``` seek(a, ≥0): Triple 1 needs a at position S → leap in SPO: first src ≥ 0 → a=0 ✓ Triple 3 needs a at position O → leap in OSP: first tgt ≥ 0 → a=0 ✓ Both agree: a=0 down(a=0) in all iterators containing a seek(b, ≥0): Triple 1 needs b at position O, with a=0 fixed at S → use SOP ordering, range narrowed to src=0: targets are {1, 2} → leap ≥ 0 → b=1 ✓ Triple 2 needs b at position S → use SPO ordering: first src ≥ 1 → b=1 ✓ Both agree: b=1 down(b=1) in all iterators containing b seek(c, ≥0): Triple 2 needs c at position O, with b=1 fixed at S → targets of node 1: {0, 2} → leap ≥ 0 → c=0 ✓ Triple 3 needs c at position S, with a=0 fixed at O → sources reaching node 0: {1, 2} → leap ≥ 0 → c=1 (0 is not in {1,2}) Mismatch! c=0 vs c=1. Rotate: Triple 2 seeks ≥ 1 → c=2 Triple 3 seeks ≥ 2 → c=2 ✓ Both agree: c=2 EMIT (a=0, b=1, c=2) ← triangle found! seek(c, ≥3): both lists exhausted → no more c up(b) seek(b, ≥2): Triple 1: targets of 0 ≥ 2 → b=2 Triple 2: sources ≥ 2 → b=2 Both agree: b=2 down(b=2) seek(c, ≥0): Triple 2: targets of 2 → {0, 1}. leap ≥ 0 → c=0 Triple 3: sources to 0 → {1, 2}. leap ≥ 0 → c=1 Mismatch. Rotate: Triple 2 seeks ≥ 1 → c=1. Triple 3 seeks ≥ 1 → c=1. ✓ Both agree: c=1 EMIT (a=0, b=2, c=1) ← another triangle! up(b) seek(b, ≥3): exhausted up(a) seek(a, ≥1): ... continues for a=1, a=2, finding all triangles ``` Each "seek" is a binary search in a sorted array. No intermediate tables. No blowup. ## 7. When LTJ activates and when it doesn't The runtime tries LTJ automatically whenever it encounters a join or concatenation of directed edges. If the pattern contains elements that can't be decomposed into triples, it falls back to the original pairwise hash-join. **LTJ handles:** - Comma-joins: `(a)->(b), (b)->(c), (c)->(a)` - Long chains: `(a)->(b)->(c)->(d)->(e)` - Combinations: `(a)->(b)->(c), (c)->(d), (a)->(d)` - Labeled edges: `(a)-[:Transfer]->(b)` (label becomes a constant) - Unlabeled edges: `(a)->(b)` (label becomes a free variable) - Anonymous nodes: `()-[]->()-[]->()` (fresh internal variables) - Left-direction edges: `(a)<-[e]-(b)` — the extractor swaps endpoints so the emitted triple is `(b, p, a)` (forward). - Undirected edges: `(a)~[e]~(b)` — the `TripleIndex` stores each undirected edge as both `(s,p,t)` and `(t,p,s)` with the same `edge_id`, so a forward lookup catches it from either endpoint. Result reconstruction recovers the `EdgeUndirectional` variant via `graph.edge_path_value(eid)`. **LTJ does NOT handle (falls back to hash-join):** - Any-direction edges (no tilde): `(a)-[e]-(b)` — distinct from undirected; matches both directed senses without the schema guarantee of an undirected edge. - Unions: `(a)->(b) | (a)->(c)` - Repetitions: `(a)->(b){1,3}` - WHERE expressions with property comparisons (labels ARE handled) The fallback is safe: results are identical, only performance differs. ### Repetitions: a future opportunity A pattern like `((a)-[:T]->(b)){1,2}` means "1 or 2 consecutive T-edges". This could be unrolled into separate LTJ executions: ``` k=1: one triple (a, T, b) k=2: two triples (a, T, _m), (_m, T, b) ``` Each execution runs without intermediate blowup. The results are unioned, and the edge variables are collected into `PathValue::List` as the runtime expects. This is not implemented yet but would extend LTJ to handle fixed repetitions. ## 8. Benchmark results: before and after All measurements on soc-LiveJournal1 (100K edges), limit=1000 results. | Query shape | Hash-join (before) | LTJ (after) | Speedup | |----------------|-------------------|-------------|---------| | 1-tree | 0.28s | 0.039s | 7x | | 2-comb | 0.40s | 0.039s | 10x | | 3-clique | 0.57s | 0.041s | 14x | | 3-cycle | 0.57s | 0.041s | 14x | | 2-3-lollipop | 1.56s | 0.040s | 39x | | 4-cycle | 4.15s | 0.041s | 101x | | 3-path | 6.33s | 0.040s | 158x | | 3-4-lollipop | 23.1s | 0.040s | 577x | | 2-tree | 101.8s | 0.039s | 2610x | | 4-path | 159.8s | 0.039s | 4097x | | **4-clique** | **never finished** | **0.043s** | **∞** | The speedup grows with query complexity because LTJ avoids exponential intermediate blowup. Simple queries (1-tree, 2-comb) see modest gains because the old approach didn't blow up on them. ## 9. Implementation overview The implementation lives in `src/runtime/ltj/` with 6 files: ``` triple_index.rs Build 6 sorted orderings from the graph. Provides leap() and range_for_key() via binary search. iterator.rs One iterator per triple pattern. Navigates the right ordering based on which variables are already bound. Handles constant labels by pre-fixing them in the stack. veo.rs Decides the order to bind variables. Non-lonely variables (shared between triples) go first. Currently static. algorithm.rs The recursive search: bind variable j, seek intersection, descend, recurse for j+1, backtrack. Evaluates filters (node labels) at each level before descending. pattern_extract.rs Transforms a PathPattern tree into flat triple patterns. Flattens Concat chains, extracts Join children, creates fresh variables for anonymous nodes. Entry point: try_ltj(). engine.rs Integration: run_join() and run_concat_pattern() call (modified) try_ltj() first. If it returns None, fall back to the original strategy. ``` ## 10. Remaining improvements 1. **Adaptive VEO**: choose the next variable dynamically based on current cardinalities, not a fixed order. The CLTJ paper shows an order-of- magnitude improvement from this alone. 2. **Cache the TripleIndex**: currently rebuilt for each query. Should be built once per graph and reused. 3. **Unroll fixed repetitions**: `{1,m}` can be decomposed into m separate LTJ runs (see section 7). 4. **Compact tries**: replace sorted Vec with LOUDS bitvectors for space efficiency on very large graphs (billions of edges). 5. **Handle any-direction edges (`-[e]-`)**: **DONE** — pure *and* mixed any-direction chains and comma-joins (issue #71). The team adopted ISO bag semantics; the LTJ base case now fans out one row per physical edge at the bound `(s, L, o)` unconditionally (fixing the directed parallel-edge collapse too), and a mirrored index (`from_graph_anydir`) stores every edge in both senses so a single forward triple matches either orientation in one LTJ pass. Verified against a hand-computed ISO oracle (`tests/{iso_multiplicity,anydir_iso}_test.rs`), behind `GQLITE_DISABLE_ANYDIR_LTJ`. Bounded unused-edge any-direction *repetitions* (`(a)-[]-{1,3}(b)`) also reach the mirror: `unroll_repeat` unrolls them into flat any-direction arms. Full rationale in `anydir-ltj-plan.md`. All any-direction paths (mixed LTJ, seeded repetition, adjacency fallback) are ISO bag-consistent — verified by `tests/anydir_path_consistency_test.rs`. **Mixed** directed+any-direction chains also run through LTJ now (`try_ltj_mixed`, per-triple index routing): each triple's iterator queries the plain or mirrored index by its `EdgeKind`, and the leapfrog joins across both (global node ids, shared label ids). Measured ~4.7× faster / ~2× less RSS than the fallback on a fraud-DB mixed query. Nothing any-direction is left on the fallback except Unions / non-unrollable repetitions. ### Caveat: the Ring's bidirectionality ≠ undirected edges A recurring slip when reading the Ring / CLTJ papers: the ring lets you traverse the *attributes* `s/p/o` in either order (which you bind first in the variable elimination), so one compact structure serves all 6 orderings. That is **access-order** bidirectionality — it is *not* the same as an edge's semantic orientation. The stored triples stay directed: a directed edge `a→b` is the single fact `(a, L, b)`, reachable via OSP by binding `b` first, but the reverse *fact* `(b, L, a)` is never manufactured by re- ordering. Only genuinely undirected edges are stored in both senses (`triple_index.rs`, `push_both`). So the ring buys cheap reverse *access*, never any-direction *matching*; conflating the two is how `-[e]-` looks "already supported" when it is not.