{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Train spaCy Model for Marathi (mr)\n", "\n", "(C) 2024 by [Damir Cavar](http://damir.cavar.me/)\n", "\n", "Use the [config widget](https://spacy.io/usage/training) on spaCy's website to generate a `base_config.cfg` configuration and paste it into the `base_config.cfg` file in this folder.\n", "\n", "Run the following command to create a full config:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!python -m spacy init fill-config ./base_config.cfg ./config.cfg" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Go to the [Universal Dependencies website](https://universaldependencies.org/) and download the [UD-Marathi-UFAL](https://github.com/UniversalDependencies/UD_Marathi-UFAL) data (train and dev). Convert the CoNLL files to the spaCy binary format using the following commands:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!python -m spacy convert ./mr_ufal-ud-dev.conllu ./dev.spacy --converter conllu --file-type spacy --seg-sents --morphology --merge-subtokens --lang mr" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!python -m spacy convert ./mr_ufal-ud-train.conllu ./train.spacy --converter conllu --file-type spacy --seg-sents --morphology --merge-subtokens --lang mr" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Run the training for Marathi:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./dev.spacy" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import spacy" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nlp = spacy.load(\"./output\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "text = u\"त्याला एक मुलगा होता.\"\n", "doc = nlp(text)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for token in doc:\n", " print(\"\\t\".join( (token.text, str(token.idx), token.lemma_, token.pos_, token.tag_, token.dep_,\n", " token.shape_, str(token.is_alpha), str(token.is_stop) )))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(C) 2024 by [Damir Cavar](http://damir.cavar.me/)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.3" } }, "nbformat": 4, "nbformat_minor": 2 }