{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# **식약처 레시피 데이터 메뉴명과 연동**\n",
    "# **1 농축산식품 레시피 수집기**\n",
    "## **01 농축산식품 API 및 CSV 전처리**\n",
    "**[레시피 재료정보 API](http://data.mafra.go.kr/opendata/data/indexOpenDataDetail.do?data_id=20150827000000000465&filter_ty=O&getBack=G&sort_id=&s_data_nm=&instt_id=&cl_code=&shareYn=)**\n",
    "1. **[glob 모듈에서 파일목록 sorting](https://redcarrot.tistory.com/222)**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>레시피 이름</th>\n",
       "      <th>레시피 코드</th>\n",
       "      <th>상세 레시피</th>\n",
       "      <th>음식분류코드</th>\n",
       "      <th>음식분류</th>\n",
       "      <th>조리시간</th>\n",
       "      <th>분량</th>\n",
       "      <th>대표이미지 URL</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>콩비지동그랑땡</td>\n",
       "      <td>195453</td>\n",
       "      <td>{'계란': '5개', '전분': '1/2T', '부침가루': '1/2T', '소금...</td>\n",
       "      <td>3010018</td>\n",
       "      <td>부침</td>\n",
       "      <td>30분</td>\n",
       "      <td>3인분</td>\n",
       "      <td>http://file.okdab.com/recipe/14829957726840013...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>누드김밥</td>\n",
       "      <td>195428</td>\n",
       "      <td>{'통깨': '약간', '마요네즈': '1T', '설탕': '0.5T', '식초':...</td>\n",
       "      <td>3010001</td>\n",
       "      <td>밥</td>\n",
       "      <td>20분</td>\n",
       "      <td>3인분</td>\n",
       "      <td>http://file.okdab.com/recipe/14829933250580012...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>쪽파 새우강회</td>\n",
       "      <td>180363</td>\n",
       "      <td>{'설탕': '1+1/2T', '통깨': '약간', '식초': '2T', '고추장'...</td>\n",
       "      <td>3010007</td>\n",
       "      <td>나물/생채/샐러드</td>\n",
       "      <td>20분</td>\n",
       "      <td>2인분</td>\n",
       "      <td>http://file.okdab.com/recipe/14829900265530011...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    레시피 이름  레시피 코드                                             상세 레시피  \\\n",
       "0  콩비지동그랑땡  195453  {'계란': '5개', '전분': '1/2T', '부침가루': '1/2T', '소금...   \n",
       "1     누드김밥  195428  {'통깨': '약간', '마요네즈': '1T', '설탕': '0.5T', '식초':...   \n",
       "2  쪽파 새우강회  180363  {'설탕': '1+1/2T', '통깨': '약간', '식초': '2T', '고추장'...   \n",
       "\n",
       "    음식분류코드       음식분류 조리시간   분량  \\\n",
       "0  3010018         부침  30분  3인분   \n",
       "1  3010001          밥  20분  3인분   \n",
       "2  3010007  나물/생채/샐러드  20분  2인분   \n",
       "\n",
       "                                           대표이미지 URL  \n",
       "0  http://file.okdab.com/recipe/14829957726840013...  \n",
       "1  http://file.okdab.com/recipe/14829933250580012...  \n",
       "2  http://file.okdab.com/recipe/14829900265530011...  "
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import json\n",
    "import pandas as pd\n",
    "df_marf = pd.read_csv(\"data/food_recipe_marf.csv\")\n",
    "df_marf['상세 레시피'] = [json.loads(_)  for _ in df_marf['상세 레시피']]\n",
    "df_marf.head(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## **02 메뉴젠 레시피 데이터**\n",
    "**[레시피 재료정보 API](http://data.mafra.go.kr/opendata/data/indexOpenDataDetail.do?data_id=20150827000000000465&filter_ty=O&getBack=G&sort_id=&s_data_nm=&instt_id=&cl_code=&shareYn=)**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>menu</th>\n",
       "      <th>recipe</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>멕시칸샐러드</td>\n",
       "      <td>{'스모크햄': 10.0, '양배추': 20.0, '당근': 10.0, '오이': ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>호박전</td>\n",
       "      <td>{'쥬키니': 60.0, '계란': 20.0}</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>양파간장지</td>\n",
       "      <td>{'양파': 30.0, '청량': 50.0, '얼갈이': 55.0, '양파 ': 4...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     menu                                             recipe\n",
       "0  멕시칸샐러드  {'스모크햄': 10.0, '양배추': 20.0, '당근': 10.0, '오이': ...\n",
       "1     호박전                          {'쥬키니': 60.0, '계란': 20.0}\n",
       "2   양파간장지  {'양파': 30.0, '청량': 50.0, '얼갈이': 55.0, '양파 ': 4..."
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import json\n",
    "with open(\"data/food_recipie.json\", \"r\", encoding='utf-8-sig') as f:\n",
    "    recipe_data = f.read()\n",
    "recipe_json = json.loads(recipe_data)\n",
    "df_menuzen  = pd.DataFrame([(_, recipe_json[_])  \n",
    "                             for _ in list(recipe_json.keys())], \n",
    "                             columns= ['menu', 'recipe'])\n",
    "df_menuzen.head(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## **03 10,000 개의 레시피**\n",
    "조금 더 표준화된 데이터 찾아서 정리하기 **[List String -> List Data](https://stackoverflow.com/questions/1894269/convert-string-representation-of-list-to-list)**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(71998, 3)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>url</th>\n",
       "      <th>menu</th>\n",
       "      <th>recipes</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>/recipe/6923603</td>\n",
       "      <td>임성근의 김치콩나물죽, 두부두루치기 알토란 261회</td>\n",
       "      <td>[[김치콩나물죽 재료], 신김치|1/4포기,콩나물|2줌,바지락살|1+1/2컵,참기름...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>/recipe/6923602</td>\n",
       "      <td>또띠아 사과 샌드위치</td>\n",
       "      <td>[[재료], 또띠아|2장,사과|1개,피자치즈|1컵,양파|1/2개,소시지|약간,시금치...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "               url                          menu  \\\n",
       "0  /recipe/6923603  임성근의 김치콩나물죽, 두부두루치기 알토란 261회   \n",
       "1  /recipe/6923602                   또띠아 사과 샌드위치   \n",
       "\n",
       "                                             recipes  \n",
       "0  [[김치콩나물죽 재료], 신김치|1/4포기,콩나물|2줌,바지락살|1+1/2컵,참기름...  \n",
       "1  [[재료], 또띠아|2장,사과|1개,피자치즈|1컵,양파|1/2개,소시지|약간,시금치...  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import json, ast\n",
    "import pandas as pd\n",
    "df_menu_10000 = pd.read_csv(\"data/food_recipe_10000.csv\", sep=\";\", header=None)\n",
    "df_menu_10000.columns = ['url', 'menu', 'tags', 'recipes']\n",
    "df_menu_10000.recipes = [ast.literal_eval(_)  for _ in df_menu_10000.recipes]\n",
    "df_menu_10000 = df_menu_10000.reindex(columns=['url','menu','recipes'])\n",
    "print(df_menu_10000.shape)\n",
    "df_menu_10000.head(2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(2, 34734),\n",
       " (4, 31132),\n",
       " (6, 4865),\n",
       " (8, 1030),\n",
       " (10, 159),\n",
       " (12, 52),\n",
       " (14, 20),\n",
       " (16, 4),\n",
       " (26, 1),\n",
       " (32, 1)]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from collections import Counter\n",
    "count_items = dict(Counter([len(_) for _ in df_menu_10000.recipes]))\n",
    "sorted(count_items.items(), key=lambda x:x[0], reverse = False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## **04 식약처 레시피 데이터**\n",
    "**[레시피 재료정보 API](http://data.mafra.go.kr/opendata/data/indexOpenDataDetail.do?data_id=20150827000000000465&filter_ty=O&getBack=G&sort_id=&s_data_nm=&instt_id=&cl_code=&shareYn=)**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>메뉴명</th>\n",
       "      <th>조리방법</th>\n",
       "      <th>요리종류</th>\n",
       "      <th>해쉬태그</th>\n",
       "      <th>이미지경로(소)</th>\n",
       "      <th>이미지경로(대)</th>\n",
       "      <th>재료정보</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>칼륨 듬뿍 고구마죽</td>\n",
       "      <td>끓이기</td>\n",
       "      <td>후식</td>\n",
       "      <td>NaN</td>\n",
       "      <td>http://www.foodsafetykorea.go.kr/uploadimg/coo...</td>\n",
       "      <td>http://www.foodsafetykorea.go.kr/uploadimg/coo...</td>\n",
       "      <td>[고구마죽, 고구마 100g(2/3개), 설탕 2g(1/3작은술), 찹쌀가루 3g(...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>누룽지 두부 계란죽</td>\n",
       "      <td>끓이기</td>\n",
       "      <td>밥</td>\n",
       "      <td>순두부</td>\n",
       "      <td>http://www.foodsafetykorea.go.kr/uploadimg/coo...</td>\n",
       "      <td>http://www.foodsafetykorea.go.kr/uploadimg/coo...</td>\n",
       "      <td>[채소준비, 애호박 30g(1/6개), 표고버섯 20g(2개), 당근 5g(3×2×...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          메뉴명 조리방법 요리종류 해쉬태그  \\\n",
       "0  칼륨 듬뿍 고구마죽  끓이기   후식  NaN   \n",
       "1  누룽지 두부 계란죽  끓이기    밥  순두부   \n",
       "\n",
       "                                            이미지경로(소)  \\\n",
       "0  http://www.foodsafetykorea.go.kr/uploadimg/coo...   \n",
       "1  http://www.foodsafetykorea.go.kr/uploadimg/coo...   \n",
       "\n",
       "                                            이미지경로(대)  \\\n",
       "0  http://www.foodsafetykorea.go.kr/uploadimg/coo...   \n",
       "1  http://www.foodsafetykorea.go.kr/uploadimg/coo...   \n",
       "\n",
       "                                                재료정보  \n",
       "0  [고구마죽, 고구마 100g(2/3개), 설탕 2g(1/3작은술), 찹쌀가루 3g(...  \n",
       "1  [채소준비, 애호박 30g(1/6개), 표고버섯 20g(2개), 당근 5g(3×2×...  "
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import json\n",
    "import pandas as pd\n",
    "df_safe   = pd.read_csv(\"data/food_recipe_safe.csv\")\n",
    "col_temps = ['메뉴명','조리방법','요리종류','해쉬태그','이미지경로(소)','이미지경로(대)','재료정보']\n",
    "df_safe   = df_safe.loc[:, col_temps]\n",
    "df_safe.재료정보 = [_.split('\\n')  for _ in df_safe.재료정보.fillna('')]\n",
    "df_safe.head(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br/>\n",
    "\n",
    "# **2 식재료 이름 및 단위 기준 만들기**\n",
    "레시피 데이터 전처리 및 단위기준 일치사전 만들기\n",
    "```\n",
    "계량법 안내\n",
    "1큰술(1T, 1Ts) = 1숟가락 \t15ml = 3t (밥숟가락 뜨면 1큰술)\n",
    "1작은술(1t, 1ts)              5ml (티스푼으로는 2스푼이 1작은술)\n",
    "1컵(1Cup, 1C) \t200ml = 16T (한국,중국,일본)  (서양(미국)은 1C가 240~250ml)\n",
    "1종이컵 \t180ml; 1oz \t28.3g\n",
    "1파운드(lb) \t약 0.453 킬로그램(kg)\n",
    "1갤런(gallon) \t약 3.78 리터(ℓ)\n",
    "1꼬집 \t약 2g 정도이며 '약간'이라고 표현하기도 함\n",
    "조금   \t약간의 2~3배\n",
    "적당량 \t기호에 따라 마음대로 조절해서 넣으란 표현\n",
    "1줌 \t    한손 가득 (예시 : 멸치 1줌 = 국멸치인 경우 12~15마리, 나물 1줌은 50g) 크게 1줌 = 2줌 [1줌의 두배]\n",
    "1주먹 \t여자 어른의 주먹크기, 고기로는 100g\n",
    "1토막 \t2~3cm두께 정도의 분량\n",
    "마늘 1톨 \t깐 마늘 한쪽\n",
    "생강 1쪽 \t마늘 1톨의 크기와 비슷\n",
    "생강 1톨 \t아기 손바닥만한 크기의 통생강 1개\n",
    "고기 1근 \t600g\n",
    "채소 1근 \t400g\n",
    "채소 1봉지 \t200g 정도\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "units = {\n",
    "    \"15ml\":[\"1큰술\",\"1T\",\"1Ts\",\"3t\"],\n",
    "    \"5ml\":[\"1작은술\",\"1t\",\"1ts\"],    \n",
    "    \"200ml\":[\"1컵\",\"1Cup\",\"1C\"],\n",
    "    \"250ml\":[\"16T\", \"1C\"],\n",
    "    \"180ml\":[\"1종이컵\"],\n",
    "    \"28.3g\":[\"1oz\"],\n",
    "    \"453g\":[\"1파운드\",\"lb\"],\n",
    "    \"3780ml\":[\"1갤런\",\"gallon\"],\n",
    "    \"2g\":[\"1꼬집\",'약간'],\n",
    "    \"4g\":[\"조금\"],\n",
    "    \"6g\":[\"조금\"],\n",
    "    \"10g\":[\"적당량\"],\n",
    "    \"50g\":[\"1줌\"], # 나물\n",
    "    \"100g\":[\"1큰줌\"], # 나물\n",
    "    \"13마리\":[\"1줌\"], # 멸치\n",
    "    \"26마리\":[\"1큰줌\"], # 멸치\n",
    "    \"100g\":[\"1주먹\"], # 여자 어른의 주먹크기\n",
    "    \"3cm\":[\"1토막\"],\n",
    "    \"1알\":[\"1톨\",\"1쪽\"], # 마늘, 생강 등\n",
    "    \"600g\":[\"1근\"], # 고기\n",
    "    \"400g\":[\"1근\"], # 채소\n",
    "    \"200g\":[\"1봉지\"],\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'16T, 1C, 1Cup, 1T, 1Ts, 1oz, 1t, 1ts, 1갤런, 1근, 1꼬집, 1봉지, 1작은술, 1종이컵, 1주먹, 1줌, 1쪽, 1컵, 1큰술, 1큰줌, 1토막, 1톨, 1파운드, 3t, gallon, lb, 약간, 적당량, 조금'"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "results = []\n",
    "for _ in [v  for k,v in units.items()]:\n",
    "    results += _\n",
    "\", \".join(sorted(set(results)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## **01 식약처 레시피 데이터**\n",
    "자료가 많고, 기준단위를 만들기에 유용"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    [고구마죽, 고구마 100g(2/3개), 설탕 2g(1/3작은술), 찹쌀가루 3g(...\n",
       "1    [채소준비, 애호박 30g(1/6개), 표고버섯 20g(2개), 당근 5g(3×2×...\n",
       "2    [초밥, 밥 210g(1공기), 배합초, 식초 20g(1⅓큰술), 설탕 10g(2작...\n",
       "3    [두부 곤약잡곡밥, 두부 110g(⅓모), 흰쌀 15g, 현미쌀 3g, 찹쌀 3g,...\n",
       "Name: 재료정보, dtype: object"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import json\n",
    "import pandas as pd\n",
    "df_safe   = pd.read_csv(\"data/food_recipe_safe.csv\")\n",
    "df_safe   = df_safe.loc[:, ['메뉴명','재료정보']]\n",
    "df_safe.재료정보 = [_.split('\\n')  for _ in df_safe.재료정보.fillna('')]\n",
    "df_safe_recipe = df_safe[\"재료정보\"]\n",
    "df_safe_recipe[:4]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(1, 459),\n",
       " (2, 127),\n",
       " (3, 202),\n",
       " (4, 190),\n",
       " (5, 95),\n",
       " (6, 76),\n",
       " (7, 24),\n",
       " (8, 14),\n",
       " (9, 9),\n",
       " (10, 1),\n",
       " (12, 1)]"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from collections import Counter\n",
    "count_items = dict(Counter([len(_)  for no, _ in enumerate(df_safe_recipe)]))\n",
    "sorted(count_items.items(), key=lambda x:x[0], reverse = False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## **02 식약처 레시피 데이터 중간묶음 찾기**\n",
    "깔대기 방식으로\n",
    "- Step1 => Step2 => Step3 단계별 결과값 저장하기"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "제목 맛있는얼큰부침개 맛있는얼큰부침개\n"
     ]
    }
   ],
   "source": [
    "# 단어만 있으면 key, () 가 포함시 Value 로 변환작업 진행하기\n",
    "temp = \"녹두 전(10g/12)\"\n",
    "temp = \"맛있는 얼큰 부침개\"\n",
    "import re\n",
    "if temp.strip().replace(\" \",\"\") == \"\".join(re.findall(\"[가-힣]+\", temp)):\n",
    "    print(\"제목\", temp.strip().replace(\" \",\"\"), \"\".join(re.findall(\"[가-힣]+\", temp)))\n",
    "else:\n",
    "    print(\"레시피\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(1198,\n",
       " [[997,\n",
       "   defaultdict(int,\n",
       "               {'레시피': '고등어 35g, 전분 5g, 땅콩 1g, 튀김기름 5g [고등어 밑 양념]생강즙 1g, 청주 2g, 소금적당량, 후춧가루 적당량[조림장]간장 1g, 고추장 3g, 토마토케첩 3g, 설탕 1g, 물엿 3g, 물 5g'})],\n",
       "  [998,\n",
       "   defaultdict(int,\n",
       "               {'레시피': '꽃게 70g, 청주 2g, 전분 3g, 실파 5g, 참기름 2g, 건고추 3g, 마늘다진것 3g, 생강다진것 2g, 소금적당량, 후춧가루적당량 [양념장]간장 2g, 청주 2g, 설탕 3g, 물 3g'})]])"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 중간묶음 전처리\n",
    "from collections import defaultdict\n",
    "results, data = [], defaultdict(int)\n",
    "for no, items in enumerate(df_safe_recipe):\n",
    "    data = defaultdict(int)  # 개별 레시피 데이터\n",
    "    for _ in items:          # 중간제목 발견시\n",
    "        if _.replace(\" \",\"\").strip() == \"\".join(re.findall(\"[가-힣]+\", _)):\n",
    "            idx_token = _    # 임시 Token 을 메모리에 올리고 data도 추가\n",
    "            data[_]\n",
    "        else:\n",
    "            if idx_token: data[idx_token] = _\n",
    "            else: data['레시피'] = _\n",
    "    results.append([no, data])\n",
    "len(results), results[997:999]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(1, 1017), (2, 98), (3, 60), (4, 21), (5, 2)]"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 결과값 내용분석\n",
    "count_items = dict(Counter([len(_[1]) for _ in results]))\n",
    "sorted(count_items.items(), key=lambda x:x[0], reverse = False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## **03 1개로 묶인 레시피 전처리**\n",
    "**Step1 => Step2 => Step3** 단계별 결과값 저장하기"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 79,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'레시피': '고등어 35g, 전분 5g, 땅콩 1g, 튀김기름 5g ',\n",
       " '고등어 밑 양념': '생강즙 1g, 청주 2g, 소금적당량, 후춧가루 적당량',\n",
       " '조림장': '간장 1g, 고추장 3g, 토마토케첩 3g, 설탕 1g, 물엿 3g, 물 5g'}"
      ]
     },
     "execution_count": 79,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "re_key = '\\[(.+?)\\]'\n",
    "menus = results[997][1]['레시피']# .replace(']',')').replace('[','(')\n",
    "\n",
    "# Key 와 Value 값 구분하기\n",
    "import re\n",
    "tokens_values = re.split(re_key, menus)   # Token 추출\n",
    "tokens_key    = re.findall(re_key, menus) # Key 선별\n",
    "token_values  = [_ for _ in tokens_values if _ not in tokens_key] # Value 선별 (Key 만 제외)\n",
    "\n",
    "if len(tokens_key) < len(token_values):\n",
    "    if len(tokens_key)+1 == len(token_values):\n",
    "        tokens_key = ['레시피']+tokens_key\n",
    "        data = {tokens_key[no]:_ for no, _ in enumerate(token_values)}\n",
    "elif len(tokens_key) == len(token_values):\n",
    "    data = {tokens_key[no]:_ for no, _ in enumerate(token_values)}\n",
    "data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(['레시피', '고등어 밑 양념', '조림장'],\n",
       " ['고등어 35g, 전분 5g, 땅콩 1g, 튀김기름 5g ',\n",
       "  '생강즙 1g, 청주 2g, 소금적당량, 후춧가루 적당량',\n",
       "  '간장 1g, 고추장 3g, 토마토케첩 3g, 설탕 1g, 물엿 3g, 물 5g'])"
      ]
     },
     "execution_count": 71,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "['레시피']+tokens_key, token_values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {},
   "outputs": [],
   "source": [
    "if len(tokens_key) == 1:\n",
    "    tokens_key = tokens_key[0]\n",
    "tokens_key"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# if len(tokens_values)%2 == 1:\n",
    "#     tokens_values = ['레시피'] + tokens_values\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2"
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tokens_values = re.split('\\[\\w+\\]', menus)\n",
    "len(tokens_values)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for _ in tokens_values:\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_safe.loc[993][\"재료정보\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_safe[\"재료정보\"][528]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for _ in df_safe_recipe:\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "[_ for _ in df_safe_recipe  if len(_) > 3]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.DataFrame(df_safe_recipe).to_csv(\"recipe_temp.csv\", index=None, sep=\"|\", header=None)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_safe_recipe[410]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 재료 목록을 우선 모아서 정리하기"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# **식약처 레시피 데이터 메뉴명과 연동**\n",
    "\n",
    "```\n",
    "계량법 안내\n",
    "1큰술(1T, 1Ts) = 1숟가락 \t15ml = 3t (밥숟가락 뜨면 1큰술)\n",
    "1작은술(1t, 1ts)              5ml (티스푼으로는 2스푼이 1작은술)\n",
    "1컵(1Cup, 1C) \t200ml = 16T (한국,중국,일본)  (서양(미국)은 1C가 240~250ml)\n",
    "1종이컵 \t180ml; 1oz \t28.3g\n",
    "1파운드(lb) \t약 0.453 킬로그램(kg)\n",
    "1갤런(gallon) \t약 3.78 리터(ℓ)\n",
    "1꼬집 \t약 2g 정도이며 '약간'이라고 표현하기도 함\n",
    "조금   \t약간의 2~3배\n",
    "적당량 \t기호에 따라 마음대로 조절해서 넣으란 표현\n",
    "1줌 \t    한손 가득 (예시 : 멸치 1줌 = 국멸치인 경우 12~15마리, 나물 1줌은 50g) 크게 1줌 = 2줌 [1줌의 두배]\n",
    "1주먹 \t여자 어른의 주먹크기, 고기로는 100g\n",
    "1토막 \t2~3cm두께 정도의 분량\n",
    "마늘 1톨 \t깐 마늘 한쪽\n",
    "생강 1쪽 \t마늘 1톨의 크기와 비슷\n",
    "생강 1톨 \t아기 손바닥만한 크기의 통생강 1개\n",
    "고기 1근 \t600g\n",
    "채소 1근 \t400g\n",
    "채소 1봉지 \t200g 정도\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_safe[\"재료정보\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# **1 momukji xlsx 살펴보기**\n",
    "## **01 Excel Sheet 내용 살펴보기**\n",
    "**빠르고, 정형화된 데이터** 중심의 작업진행\n",
    "```python\n",
    "'<title.*?>(.+?)</title>'  # 특정태그\n",
    "\"<[^>]+>|[^<]+\"            # html 태그 내부의 한글 추출\n",
    "'<.*?>'                    # 모든 태그\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['레시피_식약처', '레시피Tag', '작업모음', '영양정보', '제철정보', '식재료가격', '공산품정보', '식단예제']\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "((361, 5),        구분          메뉴명  대분류     중분류  원메뉴\n",
       " 358  국/탕류       황태국_해장  NaN  해물탕/찌개  NaN\n",
       " 359  국/탕류  후랑크소시지_김치찌개  NaN  김치국/찌개  NaN\n",
       " 360  국/탕류  후랑크소시지_두부찌개  NaN  두부국/찌개  NaN)"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Excel 파일 내부 Sheet 목록의 출력\n",
    "file_name = r'data/momukji_lab_Tag작업.xlsx'\n",
    "import ast, json, xlrd\n",
    "import pandas as pd\n",
    "xls = xlrd.open_workbook(file_name, on_demand=True)\n",
    "sht_names = [_ for _ in xls.sheet_names()]; print(sht_names)\n",
    "df_orf = pd.read_excel(file_name, sheet_name=sht_names[1])\n",
    "df = df_orf[df_orf.구분=='국/탕류']\n",
    "df = df.reset_index(drop=True)\n",
    "df.shape, df.tail(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## **02 Excel Sheet 내용 살펴보기**\n",
    "작업 진행한 내용과 수집한 데이터 정리하기"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>구분</th>\n",
       "      <th>메뉴명</th>\n",
       "      <th>대분류</th>\n",
       "      <th>중분류</th>\n",
       "      <th>원메뉴</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>국/찌개</td>\n",
       "      <td>LA김치두부찌개</td>\n",
       "      <td>찌개</td>\n",
       "      <td>두부국/찌개</td>\n",
       "      <td>두부찌개</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>국/찌개</td>\n",
       "      <td>LA두부찌개</td>\n",
       "      <td>찌개</td>\n",
       "      <td>두부국/찌개</td>\n",
       "      <td>두부찌개</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     구분       메뉴명 대분류     중분류   원메뉴\n",
       "0  국/찌개  LA김치두부찌개  찌개  두부국/찌개  두부찌개\n",
       "1  국/찌개    LA두부찌개  찌개  두부국/찌개  두부찌개"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_taged = pd.read_csv('muyong_tags.csv')\n",
    "df_taged.head(2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_org = df_orf[df_orf[\"구분\"]!='국/탕류']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>구분</th>\n",
       "      <th>메뉴명</th>\n",
       "      <th>대분류</th>\n",
       "      <th>중분류</th>\n",
       "      <th>원메뉴</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>반찬</td>\n",
       "      <td>가지,애호박,오이,부추무침</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>반찬</td>\n",
       "      <td>고사리,무나물,호박볶음</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   구분             메뉴명  대분류  중분류  원메뉴\n",
       "0  반찬  가지,애호박,오이,부추무침  NaN  NaN  NaN\n",
       "1  반찬    고사리,무나물,호박볶음  NaN  NaN  NaN"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_org.head(2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.concat([df_taged, df_org], axis=0).to_csv(\"momukji_taged_temp.csv\", encoding=\"cp949\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## **02 식약처 레시피 데이터 살펴보기**\n",
    "메뉴명에서 연관된 레시피 찾기 및 연결"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 식약처 레시피 호출\n",
    "import pandas as pd\n",
    "df_menu_mfds = pd.read_excel(file_name, sheet_name=sht_names[0])\n",
    "df_menu_mfds = df_menu_mfds.loc[:,['메뉴명','조리방법','요리종류','해쉬태그','재료정보']]\n",
    "df_menu_mfds.재료정보 = [_.split('\\n')  for _ in df_menu_mfds.재료정보.fillna('')]\n",
    "df_menu_mfds.head(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## **03 10,000 개의 레시피 데이터 살펴보기**\n",
    "조금 더 표준화된 데이터 찾아서 정리하기\n",
    "1. **[List String -> List Data](https://stackoverflow.com/questions/1894269/convert-string-representation-of-list-to-list)**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_menu_10000.recipes[100]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x = ast.literal_eval(df_menu_10000.recipes[0])\n",
    "x"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "json.loads(df_menu_10000.recipes[0])\n",
    "type(df_menu_10000.recipes[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 10,000 개의 레시피에서 세부정보 수집 및 정리\n",
    "df_menu_10000.recipes = [json.loads(_) for _ in df_menu_10000.recipes]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## **03 구분자와 재료명 판단 후 재정리 하기**\n",
    "메뉴명에서 연관된 레시피 찾기 및 연결"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Excel 파일 내부 Sheet 목록의 출력\n",
    "file_name = r'data/momukji_lab_full.xlsx'\n",
    "xls       = xlrd.open_workbook(file_name, on_demand=True)\n",
    "sht_names = [_ for _ in xls.sheet_names()]\n",
    "df_menu_zen = pd.read_excel(file_name, sheet_name=sht_names[-2])\n",
    "df_menu_zen.tail(3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# df_menu_zen.iloc[:, 2:]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Excel 파일 내부 Sheet 목록의 출력\n",
    "file_name = r'data/momukji_lab_Tag작업.xlsx'\n",
    "import xlrd\n",
    "xls = xlrd.open_workbook(file_name, on_demand=True)\n",
    "sht_names = [_ for _ in xls.sheet_names()]\n",
    "select_sheet = sht_names[1]\n",
    "select_sheet, \"/\".join(sht_names)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_menu_mfds.iloc[159,:].재료정보"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_menu_mfds.iloc[159,:]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "Counter([len(_)  for no, _ in enumerate(df_menu_mfds.재료정보)])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "[_  for no, _ in enumerate(df_menu_mfds.재료정보)  if len(_) == 3][:2]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_menu_mfds.재료정보[1].split('\\n')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br/>\n",
    "\n",
    "# **2 국/탕 메뉴분류 작업**\n",
    "1. **국/ 탕** : 한국 **고유의 말로는 '국', 한자는 '탕'** 으로 '국' 의 높임말 '탕'을 사용\n",
    "1. **찌개** : 고기나 채소, 어패류를 넣고 간장, 된장, 고추장, 새우젓 간을 맞추어 **끓인 반찬**\n",
    "1. **전골** : 음식상 옆에 **화로를 놓고 끓이거나 볶으면서 먹는 음식**\n",
    "\n",
    "## **01 Excel Sheet 내용 살펴보기**\n",
    "**빠르고, 정형화된 데이터** 중심의 작업진행"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Excel 파일 내부 Sheet 목록의 출력\n",
    "file_name = r'data/momukji_lab_Tag작업.xlsx'\n",
    "import xlrd\n",
    "xls = xlrd.open_workbook(file_name, on_demand=True)\n",
    "sht_names = [_ for _ in xls.sheet_names()]\n",
    "select_sheet = sht_names[1]\n",
    "select_sheet, \"/\".join(sht_names)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "df_org = pd.read_excel(file_name, sheet_name=select_sheet)\n",
    "df = df_org[df_org.구분=='국/탕류']\n",
    "# df = df.reset_index(drop=True)\n",
    "df.shape, df.tail(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## **02 중분류 작업의 진행**\n",
    "**Regex 를 활용하여** 전처리 작업 진행 및 **Sheet 의 Cell** 내용 추가\n",
    "- **국/ 찌개**  2개로 분류면 가능"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "result = []\n",
    "for _ in df.메뉴명:\n",
    "    if _.find(\"찌개\") != -1:\n",
    "        result.append(\"찌개\")\n",
    "    else: result.append(\"국\")\n",
    "df.대분류 = result\n",
    "df.구분  = '국/찌개'\n",
    "# df.to_csv(\"muyong_tags.csv\", index=None)\n",
    "df.tail(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br/>\n",
    "\n",
    "# **2 원메뉴 작업의 진행**\n",
    "\n",
    "## **01 중분류 내용을 바탕으로 채우기**\n",
    "**Regex 를 활용하여** 전처리 작업 진행 및 **Sheet 의 Cell** 내용 추가\n",
    "- **중분류** 내용을 기준으로 작업의 진행\n",
    "- **식약처 레시피** 와 연동하여, 구체적인 데이터 연결 확인 및 작업"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_mid_term = sorted(set(df.중분류))\n",
    "\",\".join(df_mid_term)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## **02 중분류 내용별 확인 및 작업**\n",
    "**Regex 를 활용하여** 전처리 작업 진행 및 **Sheet 의 Cell** 내용"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 0 번 컬럼의 내용 보완\n",
    "menu_name = df_mid_term[0]\n",
    "df.원메뉴[df.중분류 == menu_name] = menu_name\n",
    "df[df.중분류 == menu_name]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 1 계란국 편집\n",
    "menu_name = df_mid_term[1]\n",
    "df.원메뉴[df.중분류 == menu_name] = menu_name\n",
    "df.loc[98].원메뉴 = \"부추국\"\n",
    "df[df.중분류 == menu_name]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 2 근대국 편집\n",
    "menu_name = df_mid_term[2]\n",
    "df.원메뉴[df.중분류 == menu_name] = menu_name\n",
    "# df.loc[98].원메뉴 = \"부추국\"\n",
    "df[df.중분류 == menu_name]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.loc[26].대분류"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 3 김치국/찌개\n",
    "menu_name = df_mid_term[3]\n",
    "for _ in list(df[df.중분류 == menu_name].index):\n",
    "    if df.loc[_].대분류 == '국':\n",
    "        df.loc[_].원메뉴 = '김칫국'\n",
    "    else:\n",
    "        df.loc[_].원메뉴 = '김치찌개'\n",
    "df[df.중분류 == menu_name]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.원메뉴[df.중분류 == menu_name]# = menu_name"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# df.loc[98].원메뉴 = \"부추국\"\n",
    "df.to_csv(\"muyong_tags.csv\", index=None)\n",
    "df[df.중분류 == menu_name]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br/>\n",
    "\n",
    "# **3 원메뉴 작업의 진행**\n",
    "## **01 중분류 내용을 바탕으로 채우기**\n",
    "**Regex 를 활용하여** 전처리 작업 진행 및 **Sheet 의 Cell** 내용 추가\n",
    "- **중분류** 내용을 기준으로 작업의 진행\n",
    "- **식약처 레시피** 와 연동하여, 구체적인 데이터 연결 확인 및 작업"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[df.중분류 == df_mid_term[0]].원메뉴 = '감자탕'\n",
    "# = df_mid_term[0]\n",
    "# df.loc[list(df[df.중분류 == df_mid_term[0]].index), :]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[df.중분류 == df_mid_term[0]]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = df.loc[:, [\"메뉴명\",\"구분\",\"중분류\",\"원메뉴\",\"분류명\"]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## **02 원메뉴 작업의 진행**\n",
    "**Regex 를 활용하여** 전처리 작업 진행 및 **Sheet 의 Cell** 내용 추가\n",
    "- **국/ 찌개**  2개로 분류면 가능"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "df = pd.read_csv(\"muyong_tags.csv\")\n",
    "df.tail(3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "list(set(df.원메뉴))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[df.원메뉴.isna()].index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "len([\"찌개\" for _ in df.메뉴명   if _.find(\"찌개\") != -1])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# **10,000 개의 레시피 메뉴명 크롤링**\n",
    "## **1 수집용 함수 살펴보기**\n",
    "**빠르고, 정형화된 데이터** 중심의 작업진행\n",
    "```python\n",
    "'<title.*?>(.+?)</title>'  # 특정태그\n",
    "\"<[^>]+>|[^<]+\"            # html 태그 내부의 한글 추출\n",
    "'<.*?>'                    # 모든 태그\n",
    "from momukji import recipeMan\n",
    "page_urls = recipeMan().menu_list(11)\n",
    "page_urls[1][1], recipeMan().menu_detail(page_urls[1][0])\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br/>\n",
    "\n",
    "# **식별용 구분Token 추출 및 분석**\n",
    "## **1 식약처 메뉴정보 필터링**\n",
    "**빠르고, 정형화된 데이터** 중심의 작업진행"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "food_menu_org = pd.read_csv(\"data/food_recipe.csv\")\n",
    "food_menu_org.shape, food_menu_org.head(1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from momukji import Nlp\n",
    "nlp = Nlp('data/nouns_tokens.txt')\n",
    "menu_valid = [[\"_\".join(nlp.food_nouns(_)), _]  for _ in food_menu_org.메뉴명]\n",
    "food_menu_org.insert(2, 'menu_token',[_[0]  for _ in menu_valid])\n",
    "print(food_menu_org.head(2))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "food_menu_org.to_csv(\"data/food_recipe_addToken.csv\", index=None)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## **2 만개의 레시피 메뉴정보**\n",
    "**빠르고, 정형화된 데이터** 중심의 작업진행"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 저장된 내용 확인하기\n",
    "import pandas as pd\n",
    "df_menu = pd.read_csv(\"data/food_recipe_10000.csv\", sep=';', header=None)\n",
    "df_menu.columns = ['url','menu','tags','recipe']\n",
    "menu_valid = [[\"_\".join(nlp.food_nouns(_)), _]  for _ in df_menu.menu]\n",
    "df_menu.head(1), menu_valid[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br/>\n",
    "\n",
    "# **만개의 레시피 메뉴정보 필터링**\n",
    "**빠르고, 정형화된 데이터** 중심의 작업진행"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "df_menu.menu = [re.sub(\"\\d+회\", \"\", _).strip()   for _ in df_menu.menu]\n",
    "df_menu.menu = [re.sub(\"\\d+분\", \"\", _).strip()   for _ in df_menu.menu]\n",
    "\",\".join(list(df_menu.menu)[:20])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "menu_valid = [[\"_\".join(nlp.food_nouns(_)), _]  for _ in df_menu.menu.fillna(\"\")]\n",
    "menu_valid[:5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\"///\".join([_[0]+\":\"+_[1]  for _ in menu_valid])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}