# Using LaTeX to evaluate MATH capabilities Parsing latex is hard. This is an issue when evaluating a model expecting $\LaTeX$ as output. This is the case for the [MATH benchmark](https://huggingface.co/datasets/lighteval/MATH). This benchmark uses $\LaTeX$ to represent mathematical calculations and symbols. Evaluating this task should be a matter of parsing and comparing the ground truth and the model's output. Turns out, there is no right way to parse $\LaTeX$: ![](../../assets/sympy_doc.png) *From the [`sympy`](https://github.com/sympy/sympy) documentation* The lm-evaluation harness uses [`sympy`](https://github.com/sympy/sympy) (a Python library for symbolic mathematics) to parse latex and compare expressions. When using `sympy` to try and parse the ground truths (using the ground truth against itself), we only get around 0.94 accuracy. How could that be? Well, it turns out `sympy` cannot parse certain (correct $\LaTeX$) expressions. For example: ``` couldn't parse one of [0,1) or [0,1), I expected one of these: ']' [0,1) ~~^ ``` ``` couldn't parse one of (-\iny,-5]\cup[5,\iny) or (-\iny,-5]\cup[5,\iny), I expected something else here (-\iny,-5]\cup[5,\iny) ~~~~~~^ ``` ``` couldn't parse one of -\frac{1}{{}2x} or -\frac{1}{{}2x}, I don't understand this -\frac{1}{{}2x} ~~~~~~~~~~~^ ``` ### How do I get around this? You could either re-write the $\LaTeX$ [grammar](https://github.com/sympy/sympy/blob/master/sympy/parsing/latex/lark/grammar/latex.lark), adding needed features to the code, or add manual checks to your code to improve model scores. After almost falling into a deep rabbit hole, we decided that adding string comparison checks to our code would be sufficient. ![Fix to the Lm Eval Harness](../../assets/lm_eval_diff.png) *Fix to the LM Evaluation Harness* ### Results Here is a table comparing old and new results of the first 25 models.
Comparison of original and fixed parser on MATH benchmark
Model Score Rank
Original Fixed parser Original Fixed parser
rombodawg/Rombos-LLM-V2.5-Qwen-72b 47.58 50.68 1 1
MaziyarPanahi/calme-2.2-qwen2-72b 41.16 43.43 2 2
arcee-ai/Arcee-Nova 40.48 42.90 3 3
fblgit/TheBeagle-v2beta-32B-MGS 39.43 42.52 4 4
rombodawg/Rombos-LLM-V2.5-Qwen-32b 39.12 41.99 5 5
dnhkng/RYS-XLarge 38.97 41.24 6 6
dfurman/CalmeRys-78B-Orpo-v0.1 37.92 40.71 8 7
MaziyarPanahi/calme-2.2-rys-78b 37.92 39.95 8 9
MaziyarPanahi/calme-2.4-rys-78b 37.69 40.41 9 8
MaziyarPanahi/calme-2.3-rys-78b 36.56 38.97 10 10
MaziyarPanahi/calme-2.1-rys-78b 36.40 38.90 11 11
Qwen/Qwen2.5-72B 36.10 38.67 12 12
MaziyarPanahi/calme-2.1-qwen2-72b 36.03 38.07 13 15
Qwen/Qwen2-Math-72B-Instruct 35.95 38.14 14 14
dfurman/Qwen2-72B-Orpo-v0.1 35.42 38.14 15 13
abacusai/Smaug-Qwen2-72B-Instruct 35.35 37.46 16 19
anthracite-org/magnum-v1-72b 35.27 37.69 18 16
alpindale/magnum-72b-v1 35.27 37.69 18 16
Qwen/Qwen2-72B-Instruct 35.12 37.69 19 18
dnhkng/RYS-XLarge-base 34.67 37.16 20 20
Undi95/MG-FinalMix-72B 33.61 36.10 22 21
abacusai/Dracarys-72B-Instruct 33.61 35.65 22 22
Qwen/Qwen2.5-32B 32.85 35.50 23 23
anthracite-org/magnum-v2-72b 31.65 34.06 24 24
dnhkng/RYS-Huge-bnb-4bit 31.57 33.84 25 25