# Using LaTeX to evaluate MATH capabilities Parsing latex is hard. This is an issue when evaluating a model expecting $\LaTeX$ as output. This is the case for the [MATH benchmark](https://huggingface.co/datasets/lighteval/MATH). This benchmark uses $\LaTeX$ to represent mathematical calculations and symbols. Evaluating this task should be a matter of parsing and comparing the ground truth and the model's output. Turns out, there is no right way to parse $\LaTeX$: ![](../../assets/sympy_doc.png) *From the [`sympy`](https://github.com/sympy/sympy) documentation* The lm-evaluation harness uses [`sympy`](https://github.com/sympy/sympy) (a Python library for symbolic mathematics) to parse latex and compare expressions. When using `sympy` to try and parse the ground truths (using the ground truth against itself), we only get around 0.94 accuracy. How could that be? Well, it turns out `sympy` cannot parse certain (correct $\LaTeX$) expressions. For example: ``` couldn't parse one of [0,1) or [0,1), I expected one of these: ']' [0,1) ~~^ ``` ``` couldn't parse one of (-\iny,-5]\cup[5,\iny) or (-\iny,-5]\cup[5,\iny), I expected something else here (-\iny,-5]\cup[5,\iny) ~~~~~~^ ``` ``` couldn't parse one of -\frac{1}{{}2x} or -\frac{1}{{}2x}, I don't understand this -\frac{1}{{}2x} ~~~~~~~~~~~^ ``` ### How do I get around this? You could either re-write the $\LaTeX$ [grammar](https://github.com/sympy/sympy/blob/master/sympy/parsing/latex/lark/grammar/latex.lark), adding needed features to the code, or add manual checks to your code to improve model scores. After almost falling into a deep rabbit hole, we decided that adding string comparison checks to our code would be sufficient. ![Fix to the Lm Eval Harness](../../assets/lm_eval_diff.png) *Fix to the LM Evaluation Harness* ### Results Here is a table comparing old and new results of the first 25 models.

Model	Score		Rank
Comparison of original and fixed parser on MATH benchmark
Model	Original	Fixed parser	Original	Fixed parser
rombodawg/Rombos-LLM-V2.5-Qwen-72b	47.58	50.68	1	1
MaziyarPanahi/calme-2.2-qwen2-72b	41.16	43.43	2	2
arcee-ai/Arcee-Nova	40.48	42.90	3	3
fblgit/TheBeagle-v2beta-32B-MGS	39.43	42.52	4	4
rombodawg/Rombos-LLM-V2.5-Qwen-32b	39.12	41.99	5	5
dnhkng/RYS-XLarge	38.97	41.24	6	6
dfurman/CalmeRys-78B-Orpo-v0.1	37.92	40.71	8	7
MaziyarPanahi/calme-2.2-rys-78b	37.92	39.95	8	9
MaziyarPanahi/calme-2.4-rys-78b	37.69	40.41	9	8
MaziyarPanahi/calme-2.3-rys-78b	36.56	38.97	10	10
MaziyarPanahi/calme-2.1-rys-78b	36.40	38.90	11	11
Qwen/Qwen2.5-72B	36.10	38.67	12	12
MaziyarPanahi/calme-2.1-qwen2-72b	36.03	38.07	13	15
Qwen/Qwen2-Math-72B-Instruct	35.95	38.14	14	14
dfurman/Qwen2-72B-Orpo-v0.1	35.42	38.14	15	13
abacusai/Smaug-Qwen2-72B-Instruct	35.35	37.46	16	19
anthracite-org/magnum-v1-72b	35.27	37.69	18	16
alpindale/magnum-72b-v1	35.27	37.69	18	16
Qwen/Qwen2-72B-Instruct	35.12	37.69	19	18
dnhkng/RYS-XLarge-base	34.67	37.16	20	20
Undi95/MG-FinalMix-72B	33.61	36.10	22	21
abacusai/Dracarys-72B-Instruct	33.61	35.65	22	22
Qwen/Qwen2.5-32B	32.85	35.50	23	23
anthracite-org/magnum-v2-72b	31.65	34.06	24	24
dnhkng/RYS-Huge-bnb-4bit	31.57	33.84	25	25