{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 🚀 100 Times Faster Natural Language Processing in Python" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This iPython notebook contains the examples detailed in my post [🚀 100 Times Faster Natural Language Processing in Python](https://medium.com/huggingface/100-times-faster-natural-language-processing-in-python-ee32033bdced).\n", "\n", "To run the notebook, you will first need to:\n", "- [install Cython](http://cython.readthedocs.io/en/latest/src/quickstart/install.html), e.g. ```pip install cython```\n", "- [install spaCy](https://spacy.io/usage/), e.g. ```pip install spacy```\n", "- [download a language model for spaCy](https://spacy.io/usage/models), e.g. ```python -m spacy download en```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Cython then has to be activated in the notebook as follows:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "ExecuteTime": { "end_time": "2018-06-12T00:24:46.507199Z", "start_time": "2018-06-12T00:24:46.106511Z" } }, "outputs": [], "source": [ "%load_ext Cython" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Fast loops in Python with a bit of Cython" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![Rectangles](https://cdn-images-1.medium.com/max/800/0*RA89oQ-0j3Rscipw.jpg \"Rectangles\")\n", "\n", "In this simple example we have a large set of rectangles that we store as a list of Python objects, e.g. instances of a Rectangle class. The main job of our module is to iterate over this list in order to count how many rectangles have an area larger than a specific threshold.\n", "\n", "Our Python module is quite simple and looks like this (see also here: https://gist.github.com/thomwolf/0709b5a72cf3620cd00d94791213d38e):" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2018-06-12T00:24:59.023441Z", "start_time": "2018-06-12T00:24:59.008643Z" } }, "outputs": [], "source": [ "from random import random\n", "\n", "class Rectangle:\n", " def __init__(self, w, h):\n", " self.w = w\n", " self.h = h\n", " def area(self):\n", " return self.w * self.h\n", "\n", "def check_rectangles_py(rectangles, threshold):\n", " n_out = 0\n", " for rectangle in rectangles:\n", " if rectangle.area() > threshold:\n", " n_out += 1\n", " return n_out\n", "\n", "def main_rectangles_slow():\n", " n_rectangles = 10000000\n", " rectangles = list(Rectangle(random(), random()) for i in range(n_rectangles))\n", " n_out = check_rectangles_py(rectangles, threshold=0.25)\n", " print(n_out)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2018-06-12T00:25:15.065933Z", "start_time": "2018-06-12T00:25:00.216377Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "4034940\n", "CPU times: user 13.3 s, sys: 1.48 s, total: 14.8 s\n", "Wall time: 14.8 s\n" ] } ], "source": [ "%%time\n", "# Let's run it:\n", "main_rectangles_slow()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The ```check_rectangles``` function which loops over a large number of Python objects is our bottleneck!\n", "\n", "Let's write it in Cython.\n", "\n", "We indicate the cell is a Cython cell by using the ```%%cython``` magic command. We the cell is run, the cython code will be written in a temporary file, compiled and reimported in the iPython space. The Cython code thus have to be somehow self contained." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "ExecuteTime": { "end_time": "2018-06-12T00:25:22.049894Z", "start_time": "2018-06-12T00:25:22.030225Z" } }, "outputs": [], "source": [ "%%cython\n", "from cymem.cymem cimport Pool\n", "from random import random\n", "\n", "cdef struct Rectangle:\n", " float w\n", " float h\n", "\n", "cdef int check_rectangles_cy(Rectangle* rectangles, int n_rectangles, float threshold):\n", " cdef int n_out = 0\n", " # C arrays contain no size information => we need to state it explicitly\n", " for rectangle in rectangles[:n_rectangles]:\n", " if rectangle.w * rectangle.h > threshold:\n", " n_out += 1\n", " return n_out\n", "\n", "def main_rectangles_fast():\n", " cdef int n_rectangles = 10000000\n", " cdef float threshold = 0.25\n", " cdef Pool mem = Pool()\n", " cdef Rectangle* rectangles = mem.alloc(n_rectangles, sizeof(Rectangle))\n", " for i in range(n_rectangles):\n", " rectangles[i].w = random()\n", " rectangles[i].h = random()\n", " n_out = check_rectangles_cy(rectangles, n_rectangles, threshold)\n", " print(n_out)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "ExecuteTime": { "end_time": "2018-06-12T00:25:27.697598Z", "start_time": "2018-06-12T00:25:26.969479Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "4036387\n", "CPU times: user 676 ms, sys: 40.8 ms, total: 717 ms\n", "Wall time: 715 ms\n" ] } ], "source": [ "%%time\n", "main_rectangles_fast()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this simple case we are about 20 times faster in Cython.\n", "\n", "The ratio of improvement depends a lot on the specific syntax of the Python program.\n", "\n", "While the speed in Cython is rather predictible once your code make only use of C level objects (it is usually directly the fastest possible speed), the speed of Python can vary a lot depending on how your program is written and how much overhead the interpreter will add." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How can you be sure you Cython program makes only use of C level structures?\n", "\n", "Use the ```-a``` or ```--annotate``` flag in the ```%%cython``` magic command to display a code analysis with the line accessing and using Python objects highlighted in yellow.\n", "\n", "Here is how our the code analysis of previous program looks:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2018-06-12T00:25:54.571781Z", "start_time": "2018-06-12T00:25:54.548493Z" } }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", " \n", " Cython: _cython_magic_8305ca5d7d676d0e8a3d2abadd94b0ce.pyx\n", " \n", "\n", "\n", "

Generated by Cython 0.28.3

\n", "

\n", " Yellow lines hint at Python interaction.
\n", " Click on a line that starts with a \"+\" to see the C code that Cython generated for it.\n", "

\n", "
 01: from cymem.cymem cimport Pool
\n", "
+02: from random import random
\n", "
  __pyx_t_1 = PyList_New(1); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 2, __pyx_L1_error)\n",
       "  __Pyx_GOTREF(__pyx_t_1);\n",
       "  __Pyx_INCREF(__pyx_n_s_random);\n",
       "  __Pyx_GIVEREF(__pyx_n_s_random);\n",
       "  PyList_SET_ITEM(__pyx_t_1, 0, __pyx_n_s_random);\n",
       "  __pyx_t_2 = __Pyx_Import(__pyx_n_s_random, __pyx_t_1, 0); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 2, __pyx_L1_error)\n",
       "  __Pyx_GOTREF(__pyx_t_2);\n",
       "  __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;\n",
       "  __pyx_t_1 = __Pyx_ImportFrom(__pyx_t_2, __pyx_n_s_random); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 2, __pyx_L1_error)\n",
       "  __Pyx_GOTREF(__pyx_t_1);\n",
       "  if (PyDict_SetItem(__pyx_d, __pyx_n_s_random, __pyx_t_1) < 0) __PYX_ERR(0, 2, __pyx_L1_error)\n",
       "  __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;\n",
       "  __Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0;\n",
       "
 03: 
\n", "
+04: cdef struct Rectangle:
\n", "
struct __pyx_t_46_cython_magic_8305ca5d7d676d0e8a3d2abadd94b0ce_Rectangle {\n",
       "  float w;\n",
       "  float h;\n",
       "};\n",
       "
 05:     float w
\n", "
 06:     float h
\n", "
 07: 
\n", "
+08: cdef int check_rectangles_cy(Rectangle* rectangles, int n_rectangles, float threshold):
\n", "
static int __pyx_f_46_cython_magic_8305ca5d7d676d0e8a3d2abadd94b0ce_check_rectangles_cy(struct __pyx_t_46_cython_magic_8305ca5d7d676d0e8a3d2abadd94b0ce_Rectangle *__pyx_v_rectangles, int __pyx_v_n_rectangles, float __pyx_v_threshold) {\n",
       "  int __pyx_v_n_out;\n",
       "  struct __pyx_t_46_cython_magic_8305ca5d7d676d0e8a3d2abadd94b0ce_Rectangle __pyx_v_rectangle;\n",
       "  int __pyx_r;\n",
       "  __Pyx_RefNannyDeclarations\n",
       "  __Pyx_RefNannySetupContext(\"check_rectangles_cy\", 0);\n",
       "/* … */\n",
       "  /* function exit code */\n",
       "  __pyx_L0:;\n",
       "  __Pyx_RefNannyFinishContext();\n",
       "  return __pyx_r;\n",
       "}\n",
       "
+09:     cdef int n_out = 0
\n", "
  __pyx_v_n_out = 0;\n",
       "
 10:     # C arrays contain no size information => we need to state it explicitly
\n", "
+11:     for rectangle in rectangles[:n_rectangles]:
\n", "
  __pyx_t_2 = (__pyx_v_rectangles + __pyx_v_n_rectangles);\n",
       "  for (__pyx_t_3 = __pyx_v_rectangles; __pyx_t_3 < __pyx_t_2; __pyx_t_3++) {\n",
       "    __pyx_t_1 = __pyx_t_3;\n",
       "    __pyx_v_rectangle = (__pyx_t_1[0]);\n",
       "
+12:         if rectangle.w * rectangle.h > threshold:
\n", "
    __pyx_t_4 = (((__pyx_v_rectangle.w * __pyx_v_rectangle.h) > __pyx_v_threshold) != 0);\n",
       "    if (__pyx_t_4) {\n",
       "/* … */\n",
       "    }\n",
       "  }\n",
       "
+13:             n_out += 1
\n", "
      __pyx_v_n_out = (__pyx_v_n_out + 1);\n",
       "
+14:     return n_out
\n", "
  __pyx_r = __pyx_v_n_out;\n",
       "  goto __pyx_L0;\n",
       "
 15: 
\n", "
+16: cpdef main_rectangles_fast():
\n", "
static PyObject *__pyx_pw_46_cython_magic_8305ca5d7d676d0e8a3d2abadd94b0ce_1main_rectangles_fast(PyObject *__pyx_self, CYTHON_UNUSED PyObject *unused); /*proto*/\n",
       "static PyObject *__pyx_f_46_cython_magic_8305ca5d7d676d0e8a3d2abadd94b0ce_main_rectangles_fast(CYTHON_UNUSED int __pyx_skip_dispatch) {\n",
       "  int __pyx_v_n_rectangles;\n",
       "  float __pyx_v_threshold;\n",
       "  struct __pyx_obj_5cymem_5cymem_Pool *__pyx_v_mem = 0;\n",
       "  struct __pyx_t_46_cython_magic_8305ca5d7d676d0e8a3d2abadd94b0ce_Rectangle *__pyx_v_rectangles;\n",
       "  int __pyx_v_i;\n",
       "  int __pyx_v_n_out;\n",
       "  PyObject *__pyx_r = NULL;\n",
       "  __Pyx_RefNannyDeclarations\n",
       "  __Pyx_RefNannySetupContext(\"main_rectangles_fast\", 0);\n",
       "/* … */\n",
       "  /* function exit code */\n",
       "  __pyx_r = Py_None; __Pyx_INCREF(Py_None);\n",
       "  goto __pyx_L0;\n",
       "  __pyx_L1_error:;\n",
       "  __Pyx_XDECREF(__pyx_t_1);\n",
       "  __Pyx_XDECREF(__pyx_t_6);\n",
       "  __Pyx_XDECREF(__pyx_t_7);\n",
       "  __Pyx_AddTraceback(\"_cython_magic_8305ca5d7d676d0e8a3d2abadd94b0ce.main_rectangles_fast\", __pyx_clineno, __pyx_lineno, __pyx_filename);\n",
       "  __pyx_r = 0;\n",
       "  __pyx_L0:;\n",
       "  __Pyx_XDECREF((PyObject *)__pyx_v_mem);\n",
       "  __Pyx_XGIVEREF(__pyx_r);\n",
       "  __Pyx_RefNannyFinishContext();\n",
       "  return __pyx_r;\n",
       "}\n",
       "\n",
       "/* Python wrapper */\n",
       "static PyObject *__pyx_pw_46_cython_magic_8305ca5d7d676d0e8a3d2abadd94b0ce_1main_rectangles_fast(PyObject *__pyx_self, CYTHON_UNUSED PyObject *unused); /*proto*/\n",
       "static PyObject *__pyx_pw_46_cython_magic_8305ca5d7d676d0e8a3d2abadd94b0ce_1main_rectangles_fast(PyObject *__pyx_self, CYTHON_UNUSED PyObject *unused) {\n",
       "  PyObject *__pyx_r = 0;\n",
       "  __Pyx_RefNannyDeclarations\n",
       "  __Pyx_RefNannySetupContext(\"main_rectangles_fast (wrapper)\", 0);\n",
       "  __pyx_r = __pyx_pf_46_cython_magic_8305ca5d7d676d0e8a3d2abadd94b0ce_main_rectangles_fast(__pyx_self);\n",
       "\n",
       "  /* function exit code */\n",
       "  __Pyx_RefNannyFinishContext();\n",
       "  return __pyx_r;\n",
       "}\n",
       "\n",
       "static PyObject *__pyx_pf_46_cython_magic_8305ca5d7d676d0e8a3d2abadd94b0ce_main_rectangles_fast(CYTHON_UNUSED PyObject *__pyx_self) {\n",
       "  PyObject *__pyx_r = NULL;\n",
       "  __Pyx_RefNannyDeclarations\n",
       "  __Pyx_RefNannySetupContext(\"main_rectangles_fast\", 0);\n",
       "  __Pyx_XDECREF(__pyx_r);\n",
       "  __pyx_t_1 = __pyx_f_46_cython_magic_8305ca5d7d676d0e8a3d2abadd94b0ce_main_rectangles_fast(0); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 16, __pyx_L1_error)\n",
       "  __Pyx_GOTREF(__pyx_t_1);\n",
       "  __pyx_r = __pyx_t_1;\n",
       "  __pyx_t_1 = 0;\n",
       "  goto __pyx_L0;\n",
       "\n",
       "  /* function exit code */\n",
       "  __pyx_L1_error:;\n",
       "  __Pyx_XDECREF(__pyx_t_1);\n",
       "  __Pyx_AddTraceback(\"_cython_magic_8305ca5d7d676d0e8a3d2abadd94b0ce.main_rectangles_fast\", __pyx_clineno, __pyx_lineno, __pyx_filename);\n",
       "  __pyx_r = NULL;\n",
       "  __pyx_L0:;\n",
       "  __Pyx_XGIVEREF(__pyx_r);\n",
       "  __Pyx_RefNannyFinishContext();\n",
       "  return __pyx_r;\n",
       "}\n",
       "
+17:     cdef int n_rectangles = 10000000
\n", "
  __pyx_v_n_rectangles = 0x989680;\n",
       "
+18:     cdef float threshold = 0.25
\n", "
  __pyx_v_threshold = 0.25;\n",
       "
+19:     cdef Pool mem = Pool()
\n", "
  __pyx_t_1 = __Pyx_PyObject_CallNoArg(((PyObject *)__pyx_ptype_5cymem_5cymem_Pool)); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 19, __pyx_L1_error)\n",
       "  __Pyx_GOTREF(__pyx_t_1);\n",
       "  __pyx_v_mem = ((struct __pyx_obj_5cymem_5cymem_Pool *)__pyx_t_1);\n",
       "  __pyx_t_1 = 0;\n",
       "
+20:     cdef Rectangle* rectangles = <Rectangle*>mem.alloc(n_rectangles, sizeof(Rectangle))
\n", "
  __pyx_t_2 = ((struct __pyx_vtabstruct_5cymem_5cymem_Pool *)__pyx_v_mem->__pyx_vtab)->alloc(__pyx_v_mem, __pyx_v_n_rectangles, (sizeof(struct __pyx_t_46_cython_magic_8305ca5d7d676d0e8a3d2abadd94b0ce_Rectangle))); if (unlikely(__pyx_t_2 == ((void *)NULL))) __PYX_ERR(0, 20, __pyx_L1_error)\n",
       "  __pyx_v_rectangles = ((struct __pyx_t_46_cython_magic_8305ca5d7d676d0e8a3d2abadd94b0ce_Rectangle *)__pyx_t_2);\n",
       "
+21:     for i in range(n_rectangles):
\n", "
  __pyx_t_3 = __pyx_v_n_rectangles;\n",
       "  __pyx_t_4 = __pyx_t_3;\n",
       "  for (__pyx_t_5 = 0; __pyx_t_5 < __pyx_t_4; __pyx_t_5+=1) {\n",
       "    __pyx_v_i = __pyx_t_5;\n",
       "
+22:         rectangles[i].w = random()
\n", "
    __pyx_t_6 = __Pyx_GetModuleGlobalName(__pyx_n_s_random); if (unlikely(!__pyx_t_6)) __PYX_ERR(0, 22, __pyx_L1_error)\n",
       "    __Pyx_GOTREF(__pyx_t_6);\n",
       "    __pyx_t_7 = NULL;\n",
       "    if (CYTHON_UNPACK_METHODS && unlikely(PyMethod_Check(__pyx_t_6))) {\n",
       "      __pyx_t_7 = PyMethod_GET_SELF(__pyx_t_6);\n",
       "      if (likely(__pyx_t_7)) {\n",
       "        PyObject* function = PyMethod_GET_FUNCTION(__pyx_t_6);\n",
       "        __Pyx_INCREF(__pyx_t_7);\n",
       "        __Pyx_INCREF(function);\n",
       "        __Pyx_DECREF_SET(__pyx_t_6, function);\n",
       "      }\n",
       "    }\n",
       "    if (__pyx_t_7) {\n",
       "      __pyx_t_1 = __Pyx_PyObject_CallOneArg(__pyx_t_6, __pyx_t_7); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 22, __pyx_L1_error)\n",
       "      __Pyx_DECREF(__pyx_t_7); __pyx_t_7 = 0;\n",
       "    } else {\n",
       "      __pyx_t_1 = __Pyx_PyObject_CallNoArg(__pyx_t_6); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 22, __pyx_L1_error)\n",
       "    }\n",
       "    __Pyx_GOTREF(__pyx_t_1);\n",
       "    __Pyx_DECREF(__pyx_t_6); __pyx_t_6 = 0;\n",
       "    __pyx_t_8 = __pyx_PyFloat_AsFloat(__pyx_t_1); if (unlikely((__pyx_t_8 == (float)-1) && PyErr_Occurred())) __PYX_ERR(0, 22, __pyx_L1_error)\n",
       "    __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;\n",
       "    (__pyx_v_rectangles[__pyx_v_i]).w = __pyx_t_8;\n",
       "
+23:         rectangles[i].h = random()
\n", "
    __pyx_t_6 = __Pyx_GetModuleGlobalName(__pyx_n_s_random); if (unlikely(!__pyx_t_6)) __PYX_ERR(0, 23, __pyx_L1_error)\n",
       "    __Pyx_GOTREF(__pyx_t_6);\n",
       "    __pyx_t_7 = NULL;\n",
       "    if (CYTHON_UNPACK_METHODS && unlikely(PyMethod_Check(__pyx_t_6))) {\n",
       "      __pyx_t_7 = PyMethod_GET_SELF(__pyx_t_6);\n",
       "      if (likely(__pyx_t_7)) {\n",
       "        PyObject* function = PyMethod_GET_FUNCTION(__pyx_t_6);\n",
       "        __Pyx_INCREF(__pyx_t_7);\n",
       "        __Pyx_INCREF(function);\n",
       "        __Pyx_DECREF_SET(__pyx_t_6, function);\n",
       "      }\n",
       "    }\n",
       "    if (__pyx_t_7) {\n",
       "      __pyx_t_1 = __Pyx_PyObject_CallOneArg(__pyx_t_6, __pyx_t_7); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 23, __pyx_L1_error)\n",
       "      __Pyx_DECREF(__pyx_t_7); __pyx_t_7 = 0;\n",
       "    } else {\n",
       "      __pyx_t_1 = __Pyx_PyObject_CallNoArg(__pyx_t_6); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 23, __pyx_L1_error)\n",
       "    }\n",
       "    __Pyx_GOTREF(__pyx_t_1);\n",
       "    __Pyx_DECREF(__pyx_t_6); __pyx_t_6 = 0;\n",
       "    __pyx_t_8 = __pyx_PyFloat_AsFloat(__pyx_t_1); if (unlikely((__pyx_t_8 == (float)-1) && PyErr_Occurred())) __PYX_ERR(0, 23, __pyx_L1_error)\n",
       "    __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;\n",
       "    (__pyx_v_rectangles[__pyx_v_i]).h = __pyx_t_8;\n",
       "  }\n",
       "
+24:     n_out = check_rectangles_cy(rectangles, n_rectangles, threshold)
\n", "
  __pyx_v_n_out = __pyx_f_46_cython_magic_8305ca5d7d676d0e8a3d2abadd94b0ce_check_rectangles_cy(__pyx_v_rectangles, __pyx_v_n_rectangles, __pyx_v_threshold);\n",
       "
+25:     print(n_out)
\n", "
  __pyx_t_1 = __Pyx_PyInt_From_int(__pyx_v_n_out); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 25, __pyx_L1_error)\n",
       "  __Pyx_GOTREF(__pyx_t_1);\n",
       "  __pyx_t_6 = __Pyx_PyObject_CallOneArg(__pyx_builtin_print, __pyx_t_1); if (unlikely(!__pyx_t_6)) __PYX_ERR(0, 25, __pyx_L1_error)\n",
       "  __Pyx_GOTREF(__pyx_t_6);\n",
       "  __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;\n",
       "  __Pyx_DECREF(__pyx_t_6); __pyx_t_6 = 0;\n",
       "
" ], "text/plain": [ "" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%cython -a\n", "from cymem.cymem cimport Pool\n", "from random import random\n", "\n", "cdef struct Rectangle:\n", " float w\n", " float h\n", "\n", "cdef int check_rectangles_cy(Rectangle* rectangles, int n_rectangles, float threshold):\n", " cdef int n_out = 0\n", " # C arrays contain no size information => we need to state it explicitly\n", " for rectangle in rectangles[:n_rectangles]:\n", " if rectangle.w * rectangle.h > threshold:\n", " n_out += 1\n", " return n_out\n", "\n", "cpdef main_rectangles_fast():\n", " cdef int n_rectangles = 10000000\n", " cdef float threshold = 0.25\n", " cdef Pool mem = Pool()\n", " cdef Rectangle* rectangles = mem.alloc(n_rectangles, sizeof(Rectangle))\n", " for i in range(n_rectangles):\n", " rectangles[i].w = random()\n", " rectangles[i].h = random()\n", " n_out = check_rectangles_cy(rectangles, n_rectangles, threshold)\n", " print(n_out)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The important element here is that lines 11 to 13 are not highlighted which means they will be running at the fastest possible speed.\n", "\n", "It's ok to have yellow lines in the ```main_rectangle_fast``` function as this function will only be called once when we execute our program anyway. The yellow lines 22 and 23 are initialization lines that we could avoid by using a C level random function like `stdlib rand()` but we didn't want to clutter this example.\n", "\n", "Now here is an example of the previous cython program not optimized (with Python objects in the loop):" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "ExecuteTime": { "end_time": "2018-06-12T00:26:01.203499Z", "start_time": "2018-06-12T00:26:01.186246Z" } }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", " \n", " Cython: _cython_magic_dbc2c06a712520185e24b7d477e83d8b.pyx\n", " \n", "\n", "\n", "

Generated by Cython 0.28.3

\n", "

\n", " Yellow lines hint at Python interaction.
\n", " Click on a line that starts with a \"+\" to see the C code that Cython generated for it.\n", "

\n", "
 01: from cymem.cymem cimport Pool
\n", "
+02: from random import random
\n", "
  __pyx_t_1 = PyList_New(1); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 2, __pyx_L1_error)\n",
       "  __Pyx_GOTREF(__pyx_t_1);\n",
       "  __Pyx_INCREF(__pyx_n_s_random);\n",
       "  __Pyx_GIVEREF(__pyx_n_s_random);\n",
       "  PyList_SET_ITEM(__pyx_t_1, 0, __pyx_n_s_random);\n",
       "  __pyx_t_2 = __Pyx_Import(__pyx_n_s_random, __pyx_t_1, 0); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 2, __pyx_L1_error)\n",
       "  __Pyx_GOTREF(__pyx_t_2);\n",
       "  __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;\n",
       "  __pyx_t_1 = __Pyx_ImportFrom(__pyx_t_2, __pyx_n_s_random); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 2, __pyx_L1_error)\n",
       "  __Pyx_GOTREF(__pyx_t_1);\n",
       "  if (PyDict_SetItem(__pyx_d, __pyx_n_s_random, __pyx_t_1) < 0) __PYX_ERR(0, 2, __pyx_L1_error)\n",
       "  __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;\n",
       "  __Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0;\n",
       "
 03: 
\n", "
+04: cdef struct Rectangle:
\n", "
struct __pyx_t_46_cython_magic_dbc2c06a712520185e24b7d477e83d8b_Rectangle {\n",
       "  float w;\n",
       "  float h;\n",
       "};\n",
       "
 05:     float w
\n", "
 06:     float h
\n", "
 07: 
\n", "
+08: cdef int check_rectangles_cy(Rectangle* rectangles, int n_rectangles, float threshold):
\n", "
static int __pyx_f_46_cython_magic_dbc2c06a712520185e24b7d477e83d8b_check_rectangles_cy(struct __pyx_t_46_cython_magic_dbc2c06a712520185e24b7d477e83d8b_Rectangle *__pyx_v_rectangles, int __pyx_v_n_rectangles, float __pyx_v_threshold) {\n",
       "  PyObject *__pyx_v_n_out = NULL;\n",
       "  struct __pyx_t_46_cython_magic_dbc2c06a712520185e24b7d477e83d8b_Rectangle __pyx_v_rectangle;\n",
       "  int __pyx_r;\n",
       "  __Pyx_RefNannyDeclarations\n",
       "  __Pyx_RefNannySetupContext(\"check_rectangles_cy\", 0);\n",
       "/* … */\n",
       "  /* function exit code */\n",
       "  __pyx_L1_error:;\n",
       "  __Pyx_XDECREF(__pyx_t_5);\n",
       "  __Pyx_WriteUnraisable(\"_cython_magic_dbc2c06a712520185e24b7d477e83d8b.check_rectangles_cy\", __pyx_clineno, __pyx_lineno, __pyx_filename, 1, 0);\n",
       "  __pyx_r = 0;\n",
       "  __pyx_L0:;\n",
       "  __Pyx_XDECREF(__pyx_v_n_out);\n",
       "  __Pyx_RefNannyFinishContext();\n",
       "  return __pyx_r;\n",
       "}\n",
       "
 09:     # ========== MODIFICATION ===========
\n", "
 10:     # We changed the following line from `cdef int n_out = 0` to
\n", "
+11:     n_out = 0
\n", "
  __Pyx_INCREF(__pyx_int_0);\n",
       "  __pyx_v_n_out = __pyx_int_0;\n",
       "
 12:     # n_out is not defined as an `int` anymore and is now thus a regular Python object
\n", "
 13:     # ===================================
\n", "
+14:     for rectangle in rectangles[:n_rectangles]:
\n", "
  __pyx_t_2 = (__pyx_v_rectangles + __pyx_v_n_rectangles);\n",
       "  for (__pyx_t_3 = __pyx_v_rectangles; __pyx_t_3 < __pyx_t_2; __pyx_t_3++) {\n",
       "    __pyx_t_1 = __pyx_t_3;\n",
       "    __pyx_v_rectangle = (__pyx_t_1[0]);\n",
       "
+15:         if rectangle.w * rectangle.h > threshold:
\n", "
    __pyx_t_4 = (((__pyx_v_rectangle.w * __pyx_v_rectangle.h) > __pyx_v_threshold) != 0);\n",
       "    if (__pyx_t_4) {\n",
       "/* … */\n",
       "    }\n",
       "  }\n",
       "
+16:             n_out += 1
\n", "
      __pyx_t_5 = __Pyx_PyInt_AddObjC(__pyx_v_n_out, __pyx_int_1, 1, 1); if (unlikely(!__pyx_t_5)) __PYX_ERR(0, 16, __pyx_L1_error)\n",
       "      __Pyx_GOTREF(__pyx_t_5);\n",
       "      __Pyx_DECREF_SET(__pyx_v_n_out, __pyx_t_5);\n",
       "      __pyx_t_5 = 0;\n",
       "
+17:     return n_out
\n", "
  __pyx_t_6 = __Pyx_PyInt_As_int(__pyx_v_n_out); if (unlikely((__pyx_t_6 == (int)-1) && PyErr_Occurred())) __PYX_ERR(0, 17, __pyx_L1_error)\n",
       "  __pyx_r = __pyx_t_6;\n",
       "  goto __pyx_L0;\n",
       "
 18: 
\n", "
+19: cpdef main_rectangles_not_so_fast():
\n", "
static PyObject *__pyx_pw_46_cython_magic_dbc2c06a712520185e24b7d477e83d8b_1main_rectangles_not_so_fast(PyObject *__pyx_self, CYTHON_UNUSED PyObject *unused); /*proto*/\n",
       "static PyObject *__pyx_f_46_cython_magic_dbc2c06a712520185e24b7d477e83d8b_main_rectangles_not_so_fast(CYTHON_UNUSED int __pyx_skip_dispatch) {\n",
       "  int __pyx_v_n_rectangles;\n",
       "  float __pyx_v_threshold;\n",
       "  struct __pyx_obj_5cymem_5cymem_Pool *__pyx_v_mem = 0;\n",
       "  struct __pyx_t_46_cython_magic_dbc2c06a712520185e24b7d477e83d8b_Rectangle *__pyx_v_rectangles;\n",
       "  int __pyx_v_i;\n",
       "  int __pyx_v_n_out;\n",
       "  PyObject *__pyx_r = NULL;\n",
       "  __Pyx_RefNannyDeclarations\n",
       "  __Pyx_RefNannySetupContext(\"main_rectangles_not_so_fast\", 0);\n",
       "/* … */\n",
       "  /* function exit code */\n",
       "  __pyx_r = Py_None; __Pyx_INCREF(Py_None);\n",
       "  goto __pyx_L0;\n",
       "  __pyx_L1_error:;\n",
       "  __Pyx_XDECREF(__pyx_t_1);\n",
       "  __Pyx_XDECREF(__pyx_t_6);\n",
       "  __Pyx_XDECREF(__pyx_t_7);\n",
       "  __Pyx_AddTraceback(\"_cython_magic_dbc2c06a712520185e24b7d477e83d8b.main_rectangles_not_so_fast\", __pyx_clineno, __pyx_lineno, __pyx_filename);\n",
       "  __pyx_r = 0;\n",
       "  __pyx_L0:;\n",
       "  __Pyx_XDECREF((PyObject *)__pyx_v_mem);\n",
       "  __Pyx_XGIVEREF(__pyx_r);\n",
       "  __Pyx_RefNannyFinishContext();\n",
       "  return __pyx_r;\n",
       "}\n",
       "\n",
       "/* Python wrapper */\n",
       "static PyObject *__pyx_pw_46_cython_magic_dbc2c06a712520185e24b7d477e83d8b_1main_rectangles_not_so_fast(PyObject *__pyx_self, CYTHON_UNUSED PyObject *unused); /*proto*/\n",
       "static PyObject *__pyx_pw_46_cython_magic_dbc2c06a712520185e24b7d477e83d8b_1main_rectangles_not_so_fast(PyObject *__pyx_self, CYTHON_UNUSED PyObject *unused) {\n",
       "  PyObject *__pyx_r = 0;\n",
       "  __Pyx_RefNannyDeclarations\n",
       "  __Pyx_RefNannySetupContext(\"main_rectangles_not_so_fast (wrapper)\", 0);\n",
       "  __pyx_r = __pyx_pf_46_cython_magic_dbc2c06a712520185e24b7d477e83d8b_main_rectangles_not_so_fast(__pyx_self);\n",
       "\n",
       "  /* function exit code */\n",
       "  __Pyx_RefNannyFinishContext();\n",
       "  return __pyx_r;\n",
       "}\n",
       "\n",
       "static PyObject *__pyx_pf_46_cython_magic_dbc2c06a712520185e24b7d477e83d8b_main_rectangles_not_so_fast(CYTHON_UNUSED PyObject *__pyx_self) {\n",
       "  PyObject *__pyx_r = NULL;\n",
       "  __Pyx_RefNannyDeclarations\n",
       "  __Pyx_RefNannySetupContext(\"main_rectangles_not_so_fast\", 0);\n",
       "  __Pyx_XDECREF(__pyx_r);\n",
       "  __pyx_t_1 = __pyx_f_46_cython_magic_dbc2c06a712520185e24b7d477e83d8b_main_rectangles_not_so_fast(0); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 19, __pyx_L1_error)\n",
       "  __Pyx_GOTREF(__pyx_t_1);\n",
       "  __pyx_r = __pyx_t_1;\n",
       "  __pyx_t_1 = 0;\n",
       "  goto __pyx_L0;\n",
       "\n",
       "  /* function exit code */\n",
       "  __pyx_L1_error:;\n",
       "  __Pyx_XDECREF(__pyx_t_1);\n",
       "  __Pyx_AddTraceback(\"_cython_magic_dbc2c06a712520185e24b7d477e83d8b.main_rectangles_not_so_fast\", __pyx_clineno, __pyx_lineno, __pyx_filename);\n",
       "  __pyx_r = NULL;\n",
       "  __pyx_L0:;\n",
       "  __Pyx_XGIVEREF(__pyx_r);\n",
       "  __Pyx_RefNannyFinishContext();\n",
       "  return __pyx_r;\n",
       "}\n",
       "
+20:     cdef int n_rectangles = 10000000
\n", "
  __pyx_v_n_rectangles = 0x989680;\n",
       "
+21:     cdef float threshold = 0.25
\n", "
  __pyx_v_threshold = 0.25;\n",
       "
+22:     cdef Pool mem = Pool()
\n", "
  __pyx_t_1 = __Pyx_PyObject_CallNoArg(((PyObject *)__pyx_ptype_5cymem_5cymem_Pool)); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 22, __pyx_L1_error)\n",
       "  __Pyx_GOTREF(__pyx_t_1);\n",
       "  __pyx_v_mem = ((struct __pyx_obj_5cymem_5cymem_Pool *)__pyx_t_1);\n",
       "  __pyx_t_1 = 0;\n",
       "
+23:     cdef Rectangle* rectangles = <Rectangle*>mem.alloc(n_rectangles, sizeof(Rectangle))
\n", "
  __pyx_t_2 = ((struct __pyx_vtabstruct_5cymem_5cymem_Pool *)__pyx_v_mem->__pyx_vtab)->alloc(__pyx_v_mem, __pyx_v_n_rectangles, (sizeof(struct __pyx_t_46_cython_magic_dbc2c06a712520185e24b7d477e83d8b_Rectangle))); if (unlikely(__pyx_t_2 == ((void *)NULL))) __PYX_ERR(0, 23, __pyx_L1_error)\n",
       "  __pyx_v_rectangles = ((struct __pyx_t_46_cython_magic_dbc2c06a712520185e24b7d477e83d8b_Rectangle *)__pyx_t_2);\n",
       "
+24:     for i in range(n_rectangles):
\n", "
  __pyx_t_3 = __pyx_v_n_rectangles;\n",
       "  __pyx_t_4 = __pyx_t_3;\n",
       "  for (__pyx_t_5 = 0; __pyx_t_5 < __pyx_t_4; __pyx_t_5+=1) {\n",
       "    __pyx_v_i = __pyx_t_5;\n",
       "
+25:         rectangles[i].w = random()
\n", "
    __pyx_t_6 = __Pyx_GetModuleGlobalName(__pyx_n_s_random); if (unlikely(!__pyx_t_6)) __PYX_ERR(0, 25, __pyx_L1_error)\n",
       "    __Pyx_GOTREF(__pyx_t_6);\n",
       "    __pyx_t_7 = NULL;\n",
       "    if (CYTHON_UNPACK_METHODS && unlikely(PyMethod_Check(__pyx_t_6))) {\n",
       "      __pyx_t_7 = PyMethod_GET_SELF(__pyx_t_6);\n",
       "      if (likely(__pyx_t_7)) {\n",
       "        PyObject* function = PyMethod_GET_FUNCTION(__pyx_t_6);\n",
       "        __Pyx_INCREF(__pyx_t_7);\n",
       "        __Pyx_INCREF(function);\n",
       "        __Pyx_DECREF_SET(__pyx_t_6, function);\n",
       "      }\n",
       "    }\n",
       "    if (__pyx_t_7) {\n",
       "      __pyx_t_1 = __Pyx_PyObject_CallOneArg(__pyx_t_6, __pyx_t_7); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 25, __pyx_L1_error)\n",
       "      __Pyx_DECREF(__pyx_t_7); __pyx_t_7 = 0;\n",
       "    } else {\n",
       "      __pyx_t_1 = __Pyx_PyObject_CallNoArg(__pyx_t_6); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 25, __pyx_L1_error)\n",
       "    }\n",
       "    __Pyx_GOTREF(__pyx_t_1);\n",
       "    __Pyx_DECREF(__pyx_t_6); __pyx_t_6 = 0;\n",
       "    __pyx_t_8 = __pyx_PyFloat_AsFloat(__pyx_t_1); if (unlikely((__pyx_t_8 == (float)-1) && PyErr_Occurred())) __PYX_ERR(0, 25, __pyx_L1_error)\n",
       "    __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;\n",
       "    (__pyx_v_rectangles[__pyx_v_i]).w = __pyx_t_8;\n",
       "
+26:         rectangles[i].h = random()
\n", "
    __pyx_t_6 = __Pyx_GetModuleGlobalName(__pyx_n_s_random); if (unlikely(!__pyx_t_6)) __PYX_ERR(0, 26, __pyx_L1_error)\n",
       "    __Pyx_GOTREF(__pyx_t_6);\n",
       "    __pyx_t_7 = NULL;\n",
       "    if (CYTHON_UNPACK_METHODS && unlikely(PyMethod_Check(__pyx_t_6))) {\n",
       "      __pyx_t_7 = PyMethod_GET_SELF(__pyx_t_6);\n",
       "      if (likely(__pyx_t_7)) {\n",
       "        PyObject* function = PyMethod_GET_FUNCTION(__pyx_t_6);\n",
       "        __Pyx_INCREF(__pyx_t_7);\n",
       "        __Pyx_INCREF(function);\n",
       "        __Pyx_DECREF_SET(__pyx_t_6, function);\n",
       "      }\n",
       "    }\n",
       "    if (__pyx_t_7) {\n",
       "      __pyx_t_1 = __Pyx_PyObject_CallOneArg(__pyx_t_6, __pyx_t_7); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 26, __pyx_L1_error)\n",
       "      __Pyx_DECREF(__pyx_t_7); __pyx_t_7 = 0;\n",
       "    } else {\n",
       "      __pyx_t_1 = __Pyx_PyObject_CallNoArg(__pyx_t_6); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 26, __pyx_L1_error)\n",
       "    }\n",
       "    __Pyx_GOTREF(__pyx_t_1);\n",
       "    __Pyx_DECREF(__pyx_t_6); __pyx_t_6 = 0;\n",
       "    __pyx_t_8 = __pyx_PyFloat_AsFloat(__pyx_t_1); if (unlikely((__pyx_t_8 == (float)-1) && PyErr_Occurred())) __PYX_ERR(0, 26, __pyx_L1_error)\n",
       "    __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;\n",
       "    (__pyx_v_rectangles[__pyx_v_i]).h = __pyx_t_8;\n",
       "  }\n",
       "
+27:     n_out = check_rectangles_cy(rectangles, n_rectangles, threshold)
\n", "
  __pyx_v_n_out = __pyx_f_46_cython_magic_dbc2c06a712520185e24b7d477e83d8b_check_rectangles_cy(__pyx_v_rectangles, __pyx_v_n_rectangles, __pyx_v_threshold);\n",
       "
+28:     print(n_out)
\n", "
  __pyx_t_1 = __Pyx_PyInt_From_int(__pyx_v_n_out); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 28, __pyx_L1_error)\n",
       "  __Pyx_GOTREF(__pyx_t_1);\n",
       "  __pyx_t_6 = __Pyx_PyObject_CallOneArg(__pyx_builtin_print, __pyx_t_1); if (unlikely(!__pyx_t_6)) __PYX_ERR(0, 28, __pyx_L1_error)\n",
       "  __Pyx_GOTREF(__pyx_t_6);\n",
       "  __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;\n",
       "  __Pyx_DECREF(__pyx_t_6); __pyx_t_6 = 0;\n",
       "
" ], "text/plain": [ "" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%cython -a\n", "from cymem.cymem cimport Pool\n", "from random import random\n", "\n", "cdef struct Rectangle:\n", " float w\n", " float h\n", "\n", "cdef int check_rectangles_cy(Rectangle* rectangles, int n_rectangles, float threshold):\n", " # ========== MODIFICATION ===========\n", " # We changed the following line from `cdef int n_out = 0` to\n", " n_out = 0\n", " # n_out is not defined as an `int` anymore and is now thus a regular Python object\n", " # ===================================\n", " for rectangle in rectangles[:n_rectangles]:\n", " if rectangle.w * rectangle.h > threshold:\n", " n_out += 1\n", " return n_out\n", "\n", "cpdef main_rectangles_not_so_fast():\n", " cdef int n_rectangles = 10000000\n", " cdef float threshold = 0.25\n", " cdef Pool mem = Pool()\n", " cdef Rectangle* rectangles = mem.alloc(n_rectangles, sizeof(Rectangle))\n", " for i in range(n_rectangles):\n", " rectangles[i].w = random()\n", " rectangles[i].h = random()\n", " n_out = check_rectangles_cy(rectangles, n_rectangles, threshold)\n", " print(n_out)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that line 16 in the loop of `check_rectangles_cy` is highlighted, indicating that the Cython compiler had to add some Python API overhead. \n" ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2018-06-11T10:24:00.441633Z", "start_time": "2018-06-11T10:23:59.615416Z" } }, "source": [ "# 💫 Using Cython with spaCy to speed up NLP" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our blog post go in some details about the way spaCy can help you speed up your code by using Cython for NLP.\n", "\n", "Here is a short summary of the post:\n", "- the official Cython documentation advises against the use of C strings: `Generally speaking: unless you know what you are doing, avoid using C strings where possible and use Python string objects instead.`\n", "- spaCy let us overcome this problem by:\n", " - converting all strings to 64-bit hashes using a look up between Python unicode strings and 64-bit hashes called the `StringStore`\n", " - giving us access to fully populated C level structures of the document and vocabulary called `TokenC` and `LexemeC`\n", "\n", "The `StringStore` object is accessible from everywhere in spaCy and every object (see on the left), for example as `nlp.vocab.strings`, `doc.vocab.strings` or `span.doc.vocab.string`:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![spaCy's internals](https://cdn-images-1.medium.com/max/600/1*nxvhI7mEc9A75PwMH-PSBg.png \"spaCy's internals\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is now a simple example of NLP processing in Cython.\n", "\n", "First let's build a list of big documents and parse them using spaCy (this takes a few minutes):" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "ExecuteTime": { "end_time": "2018-06-12T00:32:03.144272Z", "start_time": "2018-06-12T00:26:17.250869Z" } }, "outputs": [], "source": [ "import urllib.request\n", "import spacy\n", "# Build a dataset of 10 parsed document extracted from the Wikitext-2 dataset\n", "with urllib.request.urlopen('https://raw.githubusercontent.com/pytorch/examples/master/word_language_model/data/wikitext-2/valid.txt') as response:\n", " text = response.read()\n", "nlp = spacy.load('en')\n", "doc_list = list(nlp(text[:800000].decode('utf8')) for i in range(10))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have about 1.7 million tokens (\"words\") in our dataset:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "ExecuteTime": { "end_time": "2018-06-12T00:33:00.268705Z", "start_time": "2018-06-12T00:33:00.133740Z" } }, "outputs": [ { "data": { "text/plain": [ "1716200" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sum(len(doc) for doc in doc_list)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We want to perform some NLP task on this dataset.\n", "\n", "For example, we would like to count the number of times the word \"run\" is used as a noun in the dataset (i.e. tagged with a \"NN\" Part-Of-Speech tag).\n", "\n", "A Python loop to do that is short and straightforward:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "ExecuteTime": { "end_time": "2018-06-12T00:33:04.382241Z", "start_time": "2018-06-12T00:33:04.362668Z" } }, "outputs": [], "source": [ "def slow_loop(doc_list, word, tag):\n", " n_out = 0\n", " for doc in doc_list:\n", " for tok in doc:\n", " if tok.lower_ == word and tok.tag_ == tag:\n", " n_out += 1\n", " return n_out\n", "\n", "def main_nlp_slow(doc_list):\n", " n_out = slow_loop(doc_list, 'run', 'NN')\n", " print(n_out)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "ExecuteTime": { "end_time": "2018-06-12T00:33:22.274366Z", "start_time": "2018-06-12T00:33:20.842882Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "90\n", "CPU times: user 1.3 s, sys: 60.2 ms, total: 1.36 s\n", "Wall time: 1.41 s\n" ] } ], "source": [ "%%time\n", "# But it's also quite slow\n", "main_nlp_slow(doc_list)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On my laptop this code takes about 1.4 second to get the answer.\n", "\n", "Let's try to speed this up with spaCy and a bit of Cython.\n", "\n", "First, we have to think about the data structure. We will need a C level array for the dataset, with pointers to each document's TokenC array. We'll also need to convert the strings we use for testing to 64-bit hashes: \"run\" and \"NN\". When all the data required for our processing is in C level objects, we can then iterate at full C speed over the dataset.\n", "\n", "Here is how this example can be written in Cython with spaCy:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "ExecuteTime": { "end_time": "2018-06-12T00:33:25.001856Z", "start_time": "2018-06-12T00:33:24.970364Z" } }, "outputs": [], "source": [ "%%cython -+\n", "import numpy # Sometime we have a fail to import numpy compilation error if we don't import numpy\n", "from cymem.cymem cimport Pool\n", "from spacy.tokens.doc cimport Doc\n", "from spacy.typedefs cimport hash_t\n", "from spacy.structs cimport TokenC\n", "\n", "cdef struct DocElement:\n", " TokenC* c\n", " int length\n", "\n", "cdef int fast_loop(DocElement* docs, int n_docs, hash_t word, hash_t tag):\n", " cdef int n_out = 0\n", " for doc in docs[:n_docs]:\n", " for c in doc.c[:doc.length]:\n", " if c.lex.lower == word and c.tag == tag:\n", " n_out += 1\n", " return n_out\n", "\n", "cpdef main_nlp_fast(doc_list):\n", " cdef int i, n_out, n_docs = len(doc_list)\n", " cdef Pool mem = Pool()\n", " cdef DocElement* docs = mem.alloc(n_docs, sizeof(DocElement))\n", " cdef Doc doc\n", " for i, doc in enumerate(doc_list): # Populate our database structure\n", " docs[i].c = doc.c\n", " docs[i].length = (doc).length\n", " word_hash = doc.vocab.strings.add('run')\n", " tag_hash = doc.vocab.strings.add('NN')\n", " n_out = fast_loop(docs, n_docs, word_hash, tag_hash)\n", " print(n_out)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "ExecuteTime": { "end_time": "2018-06-12T00:34:51.453322Z", "start_time": "2018-06-12T00:34:51.415666Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "90\n", "CPU times: user 20.6 ms, sys: 405 µs, total: 21 ms\n", "Wall time: 21 ms\n" ] } ], "source": [ "%%time\n", "main_nlp_fast(doc_list)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The code is a bit longer because we have to declare and populate the C structures in `main_nlp_fast` before calling our Cython function.\n", "\n", "But it is also a lot faster! In my Jupyter notebook, this cython code takes about 21 milliseconds to run on my laptop which is about **60 times faster** than our previous pure Python loop." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The absolute speed is also impressive for a module written in an interactive Jupyter Notebook and which can interface natively with other Python modules and functions: scanning ~1,7 million words in 18ms means we are processing **a whopping 80 millions words per seconds**." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "hide_input": false, "kernelspec": { "display_name": "Python (convai)", "language": "python", "name": "convai" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" }, "toc": { "colors": { "hover_highlight": "#DAA520", "running_highlight": "#FF0000", "selected_highlight": "#FFD700" }, "moveMenuLeft": true, "nav_menu": { "height": "12px", "width": "252px" }, "navigate_menu": true, "number_sections": false, "sideBar": true, "threshold": 4, "toc_cell": false, "toc_section_display": "block", "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }