{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this analysis, the Gunning Fog Index was used to measure the readibility score of EDMW and Reddit comments. \n",
    "\n",
    "The Gunning Fog formula generates a grade level, typically between 0 and 20. The formula estimates the years of formal education the reader requires to understand the text on first reading.\n",
    "\n",
    "So, if a piece of text has a grade level readability score of 6 then this should be easily readable by those educated to 6th grade in the US schooling system, i.e. 11-12 year olds.\n",
    "\n",
    "Text to be read by the general public should aim for a grade level of around 8. Text above a score of 17 should be taken to have graduate level readability.\n",
    "\n",
    "The formula for Gunning Fog is 0.4 [(words/sentences) + 100 (complex words/words)], where complex words are defined as those containing three or more syllables."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The textatistic module was used to derive the Gunning Fog Index\n",
    "\n",
    "http://www.erinhengel.com/software/textatistic/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "from textatistic import Textatistic"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "edmw = pd.read_csv('/Users/nus/Desktop/Project_Forum/edmw_data_clean.csv')\n",
    "reddit = pd.read_csv('/Users/nus/Desktop/Project_Forum/reddit_data_clean.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "reddit_comment = reddit['Comment']\n",
    "edmw_comment = edmw['Comment']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Make sure there is a punctuation at the end of every comment\n",
    "def fullstop(x):  \n",
    "    if str(x)[-1] == '.' or str(x)[-1] == '?' or str(x)[-1] == '!':\n",
    "        return str(x)\n",
    "    else:\n",
    "        return str(x) + '.'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "reddit_comment = reddit_comment.apply(fullstop)\n",
    "edmw_comment = edmw_comment.apply(fullstop)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Take away multiple punctuation (eg '???','...') as they will be perceived as multiple sentences\n",
    "reddit_comment = reddit_comment.str.replace(r'[.?!]{2,}','.',regex = True)\n",
    "edmw_comment = edmw_comment.str.replace(r'[.?!]{2,}','.',regex = True)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "reddit_comment = reddit_comment.to_numpy()\n",
    "edmw_comment = edmw_comment.to_numpy()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "204955"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Combine comments to form texts with >100 words\n",
    "reddit_text = ''\n",
    "reddit_combined_comments = []\n",
    "for x in reddit_comment:\n",
    "    if len(re.findall(r'\\w+', reddit_text))<100:\n",
    "        reddit_text = reddit_text + str(x)\n",
    "    else:\n",
    "        reddit_combined_comments.append(reddit_text)\n",
    "        reddit_text = ''\n",
    "        \n",
    "len(reddit_combined_comments)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "288474"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "edmw_text = ''\n",
    "edmw_combined_comments = []\n",
    "for x in edmw_comment:\n",
    "    if len(re.findall(r'\\w+', edmw_text))<100:\n",
    "        edmw_text = edmw_text + str(x)\n",
    "    else:\n",
    "        edmw_combined_comments.append(edmw_text)\n",
    "        edmw_text = ''\n",
    "        \n",
    "len(edmw_combined_comments)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "reddit_df = pd.DataFrame({'Combined Comments':reddit_combined_comments})\n",
    "edmw_df = pd.DataFrame({'Combined Comments':edmw_combined_comments})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "204955\r"
     ]
    }
   ],
   "source": [
    "# Find the Gunning Fog Index score\n",
    "reddit_gunningfog=[]\n",
    "count = 0\n",
    "for x in reddit_combined_comments:\n",
    "    count+=1\n",
    "    print(count, end='\\r')\n",
    "    try:\n",
    "        score = Textatistic(str(x)).gunningfog_score\n",
    "        reddit_gunningfog.append(score)\n",
    "    except:\n",
    "        reddit_gunningfog.append('error')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "288474\r"
     ]
    }
   ],
   "source": [
    "edmw_gunningfog=[]\n",
    "count = 0\n",
    "for x in edmw_combined_comments:\n",
    "    count+=1\n",
    "    print(count, end='\\r')\n",
    "    try:\n",
    "        score = Textatistic(str(x)).gunningfog_score\n",
    "        edmw_gunningfog.append(score)\n",
    "    except:\n",
    "        edmw_gunningfog.append('error')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [],
   "source": [
    "reddit_df['Gunning Fog Index'] = reddit_gunningfog\n",
    "edmw_df['Gunning Fog Index'] = edmw_gunningfog"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [],
   "source": [
    "reddit_df.drop(reddit_df[reddit_df['Gunning Fog Index']=='error'].index,inplace = True)\n",
    "edmw_df.drop(edmw_df[edmw_df['Gunning Fog Index']=='error'].index,inplace = True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [],
   "source": [
    "edmw_df.to_csv('~/Desktop/Project_Forum/edmw_readibility.csv')\n",
    "reddit_df.to_csv('~/Desktop/Project_Forum/reddit_readibility.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "7.605431178459541"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.mean(reddit_df['Gunning Fog Index'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "6.250669099489326"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.mean(edmw_df['Gunning Fog Index'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "   ## Limitations:\n",
    "\n",
    "1. This metric works best when used on pure English text but comments from EDMW and Reddit forums will definitely contain occasional Chinese characters (especially so for EDMW), emoticons, text emojis :) and other unconventionally structured content ¯\\_(ツ)_/¯. Effort has been made to clean the comments as much as possible to reflect an accurate score but there will still be remnants of weird comments that affect the score. \n",
    "\n",
    "\n",
    "2. Due to the unstructured and informal nature of comments, the score of readibility metrics (Flesch, Flesch-Kincaid, Dale-Chall etc) vary quite substantially among different Python readibility modules (Textatistic, Readibility etc) and readibility calculator websites, depending on how they are coded. This could be due to how different methods determine the number of sentences or number of syllables in a text. The presence of emoticons and symbols in comments would further complicate this. After testing out several modules and online calculators, the Gunning Fog Index seemed to be relatively the most consistent of the metrics and perhaps more suitable for this context. \n",
    "\n",
    "\n",
    "3. Though considered as an accurate Readability Formula, The Gunning Fog Index Formula has some flaws. For example, it discounts that not all multi-syllabic words are difficult. \n",
    "\n",
    "\n",
    "4. The Gunning Fog Index considers a text to be readible if it has short sentences with little or no multi-syllable words. Hence Reddit has a higher mean readibility score (less readible) than EDMW due to longer, complete sentences and more sophisticated English vocab in Reddit comments. However, this might not be true in reality. The much more abundant usage of internal lingo and multiple languages/dialects in EDMW serves as quite a large barrier of entry to some. An unpractised user of EDMW might not find EDMW comments to be 'readible' at all even though comments are generally shorter than Reddit. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}