{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Anacycliques\n", "\n", "Trouver tous les anacycliques dans une liste de mots (issue de lexique.org par exemple). Les palindromes sont des cas particuliers d'anacycliques. \n", "Ex : 'été' est un palidrome \n", "'vu' et 'uv', 'tort' et 'trot' sont des anacycliques\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Tester si deux mots sont des anacycliques est très simple en Python. Pour cela on peut utiliser les slices avec le troisième argument pour le pas. \n", "`word[::-1]` : un pas de `-1` permet d'inverser la chaîne de caractères." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def is_anacyclique(word1, word2):\n", " \"\"\"\n", " Returns True if the given words are palindromes\n", " False if not\n", " \"\"\"\n", " if word1 == word2[::-1]:\n", " return True\n", " else:\n", " return False" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "is_anacyclique(\"tort\", \"trot\")" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Nb de mots : 142641\n" ] } ], "source": [ "words = []\n", "with open('lexique381.ortho', 'r') as f:\n", " for line in f:\n", " line = line.rstrip()\n", " if len(line) != 1:\n", " words.append(line)\n", "print(\"Nb de mots : {}\".format(len(words)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Tester l'appartenance d'un élément à une liste de 142641 éléments est couteux en temps de traitement. \n", "Le test d'appartenance d'un élément à une liste est de complexité 0(n) (voir [TimeComplexity](https://wiki.python.org/moin/TimeComplexity) sur le wiki Python). \n", "Ici comme la liste est déjà pas mal grande et que l'opération est répétée n fois, le temps de traitement est rhédibitoire." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Puisqu'on aime se compliquer la vie on peut essayer de réduite la taille de la liste en se limitant aux mots de même taille. Les anacycliques sont forcément de même taille." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def words_by_length(words):\n", " \"\"\"\n", " Receives a list of words sorted by length, returns a list of words for each length\n", " \"\"\"\n", " for length in range(len(words[0]), len(words[-1]) + 1):\n", " yield [word for word in words if len(word) == length]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "res = set()\n", "sorted_words = sorted(words, key = lambda x:len(x))\n", "for words_n in words_by_length(sorted_words):\n", " for word in words_n:\n", " acyclique = word[::-1]\n", " if acyclique in words_n:\n", " res.add(tuple([word, acyclique]))\n", " \n", "for match in sorted(res):\n", " print(match)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "Malgré le découpage du lexique en mots de même taille et l'utilisation des générateurs, le temps de traitement reste important. \n", "Essayons avec un dictionnaire." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Nb de mots : 125627\n" ] } ], "source": [ "from collections import defaultdict\n", "words_d = defaultdict()\n", "with open('lexique381.ortho', 'r') as f:\n", " for line in f:\n", " line = line.rstrip()\n", " if len(line) != 1:\n", " words_d[line] = \"\"\n", "print(\"Nb de mots : {}\".format(len(words_d.keys())))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On a une différence de taille d'avec la liste parce que le dictionnaire supprime les doublons (125627 contre 142641). Mais ça ne change pas vraiment l'ordre du volume des données." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "('aa', 'aa')\n", "('ada', 'ada')\n", "('ados', 'soda')\n", "('adulé', 'éluda')\n", "('aga', 'aga')\n", "('agas', 'saga')\n", "('ah', 'ha')\n", "('ail', 'lia')\n", "('ailé', 'élia')\n", "('air', 'ria')\n", "('ako', 'oka')\n", "('alla', 'alla')\n", "('alloc', 'colla')\n", "('amis', 'sima')\n", "('an', 'na')\n", "('ana', 'ana')\n", "('anas', 'sana')\n", "('angor', 'rogna')\n", "('annoté', 'étonna')\n", "('ara', 'ara')\n" ] } ], "source": [ "res = set()\n", "for word in words_d.keys():\n", " acyclique = word[::-1]\n", " if acyclique in words:\n", " res.add(tuple([word, acyclique]))\n", " \n", "for match in sorted(res)[:20]:\n", " print(match)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Là c'est tout de suite BEAUCOUP plus rapide (supprimez `[:20]` pour avoir tous les résultats). Mais pourquoi ? \n", "Parce que le test d'appartenance pour un dictionnaire est estimé à O(1) en moyenne (O(n) dans le pire des cas). \n", "idem pour les ensembles (`set`)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Nb de mots : 125627\n" ] } ], "source": [ "words_s = set()\n", "with open('lexique381.ortho', 'r') as f:\n", " for line in f:\n", " line = line.rstrip()\n", " if len(line) != 1:\n", " words_s.add(line)\n", "print(\"Nb de mots : {}\".format(len(words_s)))" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "('aa', 'aa')\n", "('ada', 'ada')\n", "('ados', 'soda')\n", "('adulé', 'éluda')\n", "('aga', 'aga')\n", "('agas', 'saga')\n", "('ah', 'ha')\n", "('ail', 'lia')\n", "('ailé', 'élia')\n", "('air', 'ria')\n", "('ako', 'oka')\n", "('alla', 'alla')\n", "('alloc', 'colla')\n", "('amis', 'sima')\n", "('an', 'na')\n", "('ana', 'ana')\n", "('anas', 'sana')\n", "('angor', 'rogna')\n", "('annoté', 'étonna')\n", "('ara', 'ara')\n" ] } ], "source": [ "res = set()\n", "for word in words_s:\n", " acyclique = word[::-1]\n", " if acyclique in words:\n", " res.add(tuple([word, acyclique]))\n", " \n", "for match in sorted(res)[:20]:\n", " print(match)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Allez on va s'amuser à mesurer ça :" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 4 ms, sys: 0 ns, total: 4 ms\n", "Wall time: 3.02 ms\n", "CPU times: user 0 ns, sys: 0 ns, total: 0 ns\n", "Wall time: 7.39 µs\n", "CPU times: user 0 ns, sys: 0 ns, total: 0 ns\n", "Wall time: 8.82 µs\n" ] }, { "data": { "text/plain": [ "True" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%time \"palindrome\" in words\n", "%time \"palindrome\" in words_d\n", "%time \"palindrome\" in words_s" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Si vous êtes intéressés par l'implémentation des dictionnaires vous pourrez lire l'article suivant : [Python dictionary implementation] (http://www.laurentluce.com/posts/python-dictionary-implementation/)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2+" } }, "nbformat": 4, "nbformat_minor": 0 }