{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Train spaCy Model for Marathi (mr)\n",
    "\n",
    "(C) 2024 by [Damir Cavar](http://damir.cavar.me/)\n",
    "\n",
    "Use the [config widget](https://spacy.io/usage/training) on spaCy's website to generate a `base_config.cfg` configuration and paste it into the `base_config.cfg` file in this folder.\n",
    "\n",
    "Run the following command to create a full config:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!python -m spacy init fill-config ./base_config.cfg ./config.cfg"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Go to the [Universal Dependencies website](https://universaldependencies.org/) and download the [UD-Marathi-UFAL](https://github.com/UniversalDependencies/UD_Marathi-UFAL) data (train and dev). Convert the CoNLL files to the spaCy binary format using the following commands:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!python -m spacy convert ./mr_ufal-ud-dev.conllu ./dev.spacy --converter conllu --file-type spacy --seg-sents --morphology --merge-subtokens --lang mr"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!python -m spacy convert ./mr_ufal-ud-train.conllu ./train.spacy --converter conllu --file-type spacy --seg-sents --morphology --merge-subtokens --lang mr"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Run the training for Marathi:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./dev.spacy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import spacy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nlp = spacy.load(\"./output\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "text = u\"त्याला एक मुलगा होता.\"\n",
    "doc = nlp(text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for token in doc:\n",
    "    print(\"\\t\".join( (token.text, str(token.idx), token.lemma_, token.pos_, token.tag_, token.dep_,\n",
    "          token.shape_, str(token.is_alpha), str(token.is_stop) )))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(C) 2024 by [Damir Cavar](http://damir.cavar.me/)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}