{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# **Url Encoding / Parsing**\n", "출처: https://dololak.tistory.com/255 [코끼리를 냉장고에 넣는 방법]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# **urllib 모듈을 사용한 경로명 작업**\n", "\n", "## **1 Url Encoding**\n", "Url Query 생성 및 경로 연결" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'https://www.kamis.or.kr/customer/price/retail/item.do?action=priceinfo'" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Base url 추출 및 추가경로 붙이기\n", "from urllib import parse\n", "\n", "url_root = \"https://www.kamis.or.kr/test/join\"\n", "url_base = parse.urljoin(url_root, '/customer/price/retail/item.do?action=priceinfo')\n", "url_base" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'examParam1=value1&examParam2=aaa&examParam2=bbb'" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query = {\n", " 'examParam1': 'value1',\n", " 'examParam2': ['aaa', 'bbb']\n", "} \n", "parse.urlencode(query, encoding='UTF-8', doseq=True)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'regday=2019-10-18&itemcode=111&convert_kg_yn=N&itemcategorycode=100'" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Param 를 사용하여 Query Url 만들기\n", "query = {\n", " \"regday\": \"2019-10-18\",\n", " \"itemcode\": \"111\",\n", " \"convert_kg_yn\": \"N\",\n", " \"itemcategorycode\": \"100\",\n", "}\n", "url_query = parse.urlencode(query, encoding='UTF-8', doseq=True)\n", "url_query" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'https://www.kamis.or.kr/customer/price/retail/item.do?action=priceinfo®day=2019-10-18&itemcode=111&convert_kg_yn=N&itemcategorycode=100'" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Base url 과 Params 연결하기\n", "url = url_base + \"&\" + url_query\n", "url" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **2 Url Parsing**\n", "필요한 부분 추출하기" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ParseResult(scheme='https', netloc='www.kamis.or.kr', path='/customer/price/retail/item.do', params='', query='action=priceinfo®day=2019-10-18&itemcode=111&convert_kg_yn=N&itemcategorycode=100', fragment='')" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 긴 Url 경로를 Parsing 하기\n", "url_parse = parse.urlparse(url)\n", "url_parse" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'www.kamis.or.kr'" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Root Url 추출하기 (url Net Location)\n", "url_parse.netloc" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'/customer/price/retail/item.do'" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# url Add Paths\n", "url_parse.path" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'action=priceinfo®day=2019-10-18&itemcode=111&convert_kg_yn=N&itemcategorycode=100'" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# url Query\n", "url_parse.query" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "# **Urllib 의 Request 크롤링**\n", "**[한국 IT협회 교육내용 정리](https://leechamin.tistory.com/category/IT%20%ED%99%9C%EB%8F%99/%EC%9D%B8%EA%B3%B5%EC%A7%80%EB%8A%A5%EA%B5%90%EC%9C%A1%20-%20NLP?page=5)**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **1 Request 를 사용한 크롤링**\n", "Urllib 모듈의 **request** 사용" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "b'User-agent: Yeti\\nAllow: /main/imagemontage\\nDisallow: /\\nUser-agent: *\\nDisallow: /'" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Python 기본 모듈인 urllib 을 사용하여 Robots.txt 내용 확인하기\n", "from urllib import request\n", "\n", "url = \"https://news.naver.com/robots.txt\"\n", "resp = request.urlopen(url)\n", "resp.read() # 통으로 불러오기" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[b'\\t\\t\\t\\t\\t\\t\\t\\t\\t\\xeb\\xa1\\x9c\\xea\\xb7\\xb8\\xec\\x9d\\xb8\\r\\n',\n", " b'\\t\\t\\t\\t\\t\\t\\t\\t\\r\\n',\n", " b'\\t\\t\\t\\t\\t\\t\\t\\t
  • \\r\\n',\n", " b'\\t\\t\\t\\t\\t\\t\\t\\t\\t\\xed\\x9a\\x8c\\xec\\x9b\\x90\\xea\\xb0\\x80\\xec\\x9e\\x85\\r\\n']" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 필요한 상세페이지 크롤링\n", "url = \"https://www.kamis.or.kr/customer/price/retail/item.do?action=priceinfo®day=2019-10-18&itemcode=111&convert_kg_yn=N&itemcategorycode=100\"\n", "resp = request.urlopen(url)\n", "r_test = resp.readlines()[103:107]\n", "r_test" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **2 한글 인코딩**\n", "1. **Byte Text** 를 **str** 로 변환하고\n", "1. **인코딩** 으로 **한글을 복구** 합니다." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(str,\n", " '%09%09%09%09%09%09%09%09%09%3Ca+href%3D%22%2Fcustomer%2Fcustomer_service%2Fjoin%2Fcustomer_join.do%22%3E%ED%9A%8C%EC%9B%90%EA%B0%80%EC%9E%85%3C%2Fa%3E%0D%0A')" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from urllib import parse\n", "txt_byte = parse.quote_plus(r_test[3])\n", "type(txt_byte), txt_byte" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'\\t\\t\\t\\t\\t\\t\\t\\t\\t회원가입\\r\\n'" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 변환된 str 중 인코딩 문제를 해결\n", "parse.unquote_plus(txt_byte)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **3 한글의 초성, 중성, 조성 분리**\n", "**[한글의 자음/모음 분해하기](https://frhyme.github.io/python/python_korean_englished/)**" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "\"\"\"Generate Strings from `c1` to `c2`, inclusive.\"\"\"\n", "def str_range(c1, c2):\n", " for c in range(ord(c1), ord(c2)+1):\n", " yield chr(c)\n", "\n", "str_start = [_ for _ in [c for c in str_range('ㄱ', 'ㅎ')] # 초성 리스트\n", " if _ not in ['ㄳ','ㄵ','ㄶ','ㄺ','ㄻ','ㄼ','ㄽ','ㄾ','ㄿ','ㅀ','ㅄ']]\n", "str_mid = [c for c in str_range('ㅏ', 'ㅣ')] # 중성 리스트\n", "str_last = [_ for _ in [\" \"] + [c for c in str_range('ㄱ', 'ㅎ')] \n", " if _ not in ['ㄸ', 'ㅃ', 'ㅉ']] # 종성 리스트" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[['ㄷ', 'ㅚ', 'ㄴ'],\n", " ['ㅈ', 'ㅏ', 'ㅇ'],\n", " ['ㅉ', 'ㅣ', ' '],\n", " ['ㄱ', 'ㅐ', ' '],\n", " [' '],\n", " ['ㅂ', 'ㅗ', 'ㄲ'],\n", " ['ㅇ', 'ㅡ', 'ㅁ'],\n", " ['a'],\n", " ['b'],\n", " ['c']]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def str_korchar(str_text):\n", " r_lst = []\n", " for _ in list(str_text.strip()):\n", " if '가' <= _ <= '힣': # 영어는 구분작성 \n", " ch1 = (ord(_)-ord('가'))//588 # 588개 마다 초성변경 \n", " ch2 = ((ord(_)-ord('가'))-(588*ch1))//28 # 중성은 총 28개\n", " ch3 = (ord(_) - ord('가'))-(588*ch1)-28*ch2\n", " r_lst.append([str_start[ch1], str_mid[ch2], str_last[ch3]])\n", " else: r_lst.append([_])\n", " return r_lst\n", "\n", "str_korchar(\"된장찌개 볶음abc\")" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['깍', '두', '기']" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(\"깍두기\")" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[['ㄲ', 'ㅏ', 'ㄱ'], ['ㄷ', 'ㅜ', ' '], ['ㄱ', 'ㅣ', ' ']]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "str_korchar(\"깍두기\")" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(ord(\"깍\")-ord(\"가\"))//588" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
    \n", "\n", "# **Requests 크롤링**\n", "**[Pypi](https://pypi.org/project/requests/) :** ! pip install requests " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **1 Urllib 의 Parse 를 활용한 경로추출**\n", "앞에서 작업했던 내용을 활용하여 경로명 만들기" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "SplitResult(scheme='https', netloc='www.kamis.or.kr', path='/customer/price/retail/item.do', query='action=priceinfo', fragment='')" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Base url 경로\n", "url = \"https://www.kamis.or.kr/customer/price/retail/item.do?action=priceinfo\"\n", "\n", "from urllib import parse\n", "parse.urlsplit(url)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'®day=2019-10-18&itemcategorycode=100&itemcode=111&convert_kg_yn=N'" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "params = {\n", " \"regday\" : \"2019-10-18\",\n", " \"itemcategorycode\" : \"100\",\n", " \"itemcode\" : \"111\",\n", " \"convert_kg_yn\" : \"N\",\n", "}\n", "query = '&' + parse.urlencode(params, encoding='UTF-8')\n", "query" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'https://www.kamis.or.kr/customer/price/retail/item.do?action=priceinfo®day=2019-10-18&itemcategorycode=100&itemcode=111&convert_kg_yn=N'" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "url + query" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
    \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
    구분구분.1구분.2당일 10/181일전10/172일전10/163일전10/154일전10/147일전10/111개월전1년전평년
    0평균평균평균514695147551484516955169551318515535325345758
    1최고값최고값최고값580005800058000580005800056000580006250059600
    2최저값최저값최저값449004490044900459004590045900478004900033800
    \n", "
    " ], "text/plain": [ " 구분 구분.1 구분.2 당일 10/18 1일전10/17 2일전10/16 3일전10/15 4일전10/14 7일전10/11 1개월전 \\\n", "0 평균 평균 평균 51469 51475 51484 51695 51695 51318 51553 \n", "1 최고값 최고값 최고값 58000 58000 58000 58000 58000 56000 58000 \n", "2 최저값 최저값 최저값 44900 44900 44900 45900 45900 45900 47800 \n", "\n", " 1년전 평년 \n", "0 53253 45758 \n", "1 62500 59600 \n", "2 49000 33800 " ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "pd.read_html(url + query)[3].head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **2 Requests 를 사용한 Get 크롤링**\n", "1. Url 의 **pararm** 경로와, 가상 **web Header** 인코딩을 자동처리 합니다.\n", "1. https://pypi.org/project/fake-useragent/" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "# Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US);\n", "# Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; GTB7.4; InfoPath.2; SV1; .NET CLR 3.3.69573; WOW64; en-US)\n", "# Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.2 (KHTML, like Gecko) Chrome/22.0.1216.0 Safari/537.2'\n", "# Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/537.13 (KHTML, like Gecko) Chrome/24.0.1290.1 Safari/537.13\n", "# Mozilla/5.0 (Windows NT 6.2; Win64; x64; rv:16.0.1) Gecko/20121011 Firefox/16.0.1\n", "# Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:15.0) Gecko/20100101 Firefox/15.0.1\n", "# Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11\n", "# Mozilla/5.0 (iPad; CPU OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5355d Safari/8536.25" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'\\t\\t\\t
  • 농수산식품기업지원센터
  • \\r\\n\\t\\t\\t\\t\\t
  • aT메인
  • \\r\\n\\t\\t\\t\\t\\r\\n\\t\\t\\t\\r\\n\\t\\t\\t
    \\r\\n\\t\\t\\t\\t<'" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "url = \"https://www.kamis.or.kr/customer/price/retail/item.do?action=priceinfo\"\n", "params = {\n", " \"regday\" : \"2019-10-18\",\n", " \"itemcategorycode\" : \"100\",\n", " \"itemcode\" : \"111\",\n", " \"convert_kg_yn\" : \"N\",\n", "}\n", "\n", "import requests\n", "web = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.2 (KHTML, like Gecko) Chrome/22.0.1216.0 Safari/537.2'\n", "headers = {'User-Agent' : web}\n", "resp = requests.get(url, params=params, headers=headers)\n", "resp.text[2800:3000]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **2 Requests 를 사용한 Post 크롤링**\n", "1. Url 의 **pararm** 경로와, 가상 **web Header** 인코딩을 자동처리 합니다.\n", "1. https://pypi.org/project/fake-useragent/" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "b'kR4BVNFag+szu8hB3hhNym4wnhHiECpalBD0QD5gSoY4/Mjy4rk9q1LU0QEBvNx3VL/nX4om315BLTjIX61DApKASSKfgGAp4Jqixd5FS6sKsv5LGpssDJ7Zl00V37JevrpGopSxpJATyEj4YPPUfOCnedA+9U5lnd06bVaLHpljzFThxW7yJKPGJ2/gVaY36o75kXbijHhx476kbDn1h7fJpvPTvq+bUW+PV6jAxw11fjqegl6QlsJCpllml/kOmriWaPY/1wrCgpqPf1MfAO0M5UAH9uwubXKtNjgkp5w='" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# OTP key 값 추출하기\n", "# KRX 주가 데이터 수집예제\n", "gen_otp_url = 'http://marketdata.krx.co.kr/contents/COM/GenerateOTP.jspx'\n", "gen_otp_data = {\n", " 'name':'fileDown', \n", " 'filetype':'xls',\n", " 'market_gubun':'ALL', #시장구분: ALL=전체\n", " 'url':'MKD/04/0404/04040200/mkd04040200_01',\n", " 'schdate':'20191018',\n", " 'pagePath':'/contents/MKD/04/0404/04040200/MKD04040200.jsp'\n", "}\n", "r = requests.post(gen_otp_url, gen_otp_data)\n", "OTP_code = r.content\n", "OTP_code" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "b'\\r\\n\\r\\n\\r\\n\\r\\n\\r\\n\\r\\n\\t\\xea\\xb0\\x80\\xea\\xb2\\xa9\\xec\\xa0\\x95\\xeb\\xb3\\xb4 - \\xec\\xb9\\x9c\\xed\\x99\\x98\\xea\\xb2\\xbd\\xeb\\x86\\x8d\\xec\\x82\\xb0\\xeb\\xac\\xbc - \\xed\\x92\\x88\\xeb\\xaa\\xa9\\xeb\\xb3\\x84 \\xec\\x97\\x91\\xec\\x85\\x80 \\xec\\xb6\\x9c\\xeb\\xa0\\xa5\\r\\n\\t\\r\\n\\t\\r\\n\\r\\n\\r\\n\\t\\r\\n\\t\\t\\r\\n\\t\\t\\t\\r\\n\\t\\t\\t\\t
    \\r\\n\\t\\t\\t\\t\\t\\r\\n\\t\\t\\t\\t
    \\r\\n\\t\\t\\t\\r\\n\\t\\t\\r\\n\\t\\r\\n\\r\\n'" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = {\n", " \"Accept\" :\"text/html,application/xhtml+xm…plication/xml;q=0.9,*/*;q=0.8\",\n", " \"Accept-Encoding\" : \"gzip, deflate, br\",\n", " \"Accept-Language\" : \"ko-KR,ko;q=0.8,en-US;q=0.5,en;q=0.3\",\n", " \"Connection\" : \"keep-alive\",\n", " \"Content-Length\" : \"76940\",\n", " \"Content-Type\" : \"application/x-www-form-urlencoded\",\n", " \"Host\" : \"www.kamis.or.kr\",\n", " \"Referer\" : url + query,\n", " \"Upgrade-Insecure-Requests\" : \"1\",\n", " \"User-Agent\" : \"Mozilla/5.0 (X11; Ubuntu; Linu…) Gecko/20100101 Firefox/69.0\"\n", "}\n", "resp = requests.post(\"https://www.kamis.or.kr/jsp/common/excel_util.jsp\", data=data)\n", "resp.content" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **3 KAMIS 데이터 크롤링**\n", "Get 방식으로 기존의 lxml 모듈로 수집하기" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "url = \"https://www.kamis.or.kr/customer/price/retail/item.do?action=priceinfo\"\n", "params = {\n", " \"regday\" : \"2015-10-14\",\n", " \"itemcategorycode\" : \"100\",\n", " \"itemcode\" : \"111\",\n", " \"convert_kg_yn\" : \"N\",\n", "}\n", "\n", "import requests\n", "web = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.2 (KHTML, like Gecko) Chrome/22.0.1216.0 Safari/537.2'\n", "headers = {'User-Agent' : web}\n", "resp = requests.get(url, params=params, headers=headers)\n", "resp" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['45042', '50000', '38800']" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from lxml.html import fromstring, tostring\n", "resp_lxml = fromstring(resp.text)\n", "resp_table = resp_lxml.xpath('//table[@id=\"itemTable_1\"]')[0]\n", "\n", "import pandas as pd\n", "table = pd.read_html(tostring(resp_table))[0]\n", "table_idx = [_ for _, txt in enumerate(list(table.columns)) if txt.find('당일') != -1]\n", "\n", "if len(table_idx) == 1:\n", " table_idx = table_idx[0]\n", " table = table.iloc[:,[0, table_idx]].set_index(\n", " table.columns[0]).loc[[\"평균\",\"최고값\",\"최저값\"]]\n", " table = list(table.to_dict(orient='list').values())[0]\n", "else:\n", " table = None\n", "table" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
    \n", "\n", "# **API Key 를 활용한 크롤링**\n", "\n", "## **1 식품안전나라 안내 Page**\n", "1. **[Open API 메인 페이지](https://www.foodsafetykorea.go.kr/api/userApiKey.do#)**, **[API 활용 방법](https://www.foodsafetykorea.go.kr/api/howToUseApi.do?menu_grp=MENU_GRP34&menu_no=687)**, **[API 이용절차](https://www.foodsafetykorea.go.kr/api/board/boardDetail.do)**\n", "\n", "상세페이지 바로가기" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 조리 레시피 API 안내 페이지 https://www.foodsafetykorea.go.kr/api/openApiInfo.do\n", "query = {\n", " \"menu_no\":\"849\",\n", " \"menu_grp\":\"MENU_GRP31\",\n", " \"start_idx\":\"120\",\n", " \"svc_no\":\"COOKRCP01\",\n", " \"svc_type_cd\":\"API_TYPE06\",\n", " \"show_cnt\":\"10\",\n", " \"Referer\":\"https://www.foodsafetykorea.go.kr/api/sheetInfo.do\",\n", "}\n", "import requests\n", "resp = requests.post(\"https://www.foodsafetykorea.go.kr/api/openApiInfo.do\", data=query)\n", "resp\n", "# with open(\"food.html\", \"w\") as f:\n", "# f.write(resp.text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **2 API 크롤링**\n", "1. **[Open API 메인 페이지](https://www.foodsafetykorea.go.kr/api/userApiKey.do#)**, **[API 활용 방법](https://www.foodsafetykorea.go.kr/api/howToUseApi.do?menu_grp=MENU_GRP34&menu_no=687)**, **[API 이용절차](https://www.foodsafetykorea.go.kr/api/board/boardDetail.do)**\n", "1. key : \"8acba1823ae742359560\"\n", "\n", "**\"keyId\"** 부분에 **인증키를** 입력하고, **\"serviceId\"에는** 미리보기 주소에 있는 **serviceId** 를 사용\n", "\n", "ex) http://openapi.foodsafetykorea.go.kr/api/인증키/I0580/xml/1/20 \n", "\n", "1. HACCP 적용업소 지정현황 **'I0580'** 사용\n", "1. **'dataType'** 에서 **Xml, Json** 중 타입을 한가지 입력\n", "1. **'startIdx'** 에서 **시작 정보 행 순번** (1번부터 : '1')\n", "1. **'endIdx'** 에서 **마지막 행 순번** (20번까지 : '20')" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "# http://openapi.foodsafetykorea.go.kr/api/keyId/serviceId/dataType/startIdx/endIdx\n", "\n", "url = \"http://openapi.foodsafetykorea.go.kr/api\" # sample/I2620/xml/1/5\"\n", "keyId = \"8acba1823ae742359560\"\n", "serviceId = \"I0750\" # \"I0580\" # I0750\n", "dataType = \"json\"\n", "startNum = \"1\"\n", "endNum = \"3\"\n", "url = \"/\".join([url, keyId, serviceId, dataType, startNum, endNum])" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 7.15 ms, sys: 402 µs, total: 7.55 ms\n", "Wall time: 22.7 s\n" ] }, { "data": { "text/plain": [ "{'I0750': {'RESULT': {'MSG': '정상처리되었습니다.', 'CODE': 'INFO-000'},\n", " 'total_count': '13824',\n", " 'row': [{'NUTR_CONT3': '10.1',\n", " 'NUTR_CONT2': '67.8',\n", " 'NUTR_CONT1': '349',\n", " 'FOOD_GROUP': '농촌진흥청 식품성분표 DB',\n", " 'BGN_YEAR': '2001',\n", " 'NUTR_CONT9': 'N/A',\n", " 'NUTR_CONT8': 'N/A',\n", " 'FOOD_CD': '100101000100000001',\n", " 'NUTR_CONT7': 'N/A',\n", " 'NUTR_CONT6': 'N/A',\n", " 'NUTR_CONT5': 'N/A',\n", " 'NUTR_CONT4': '3.7',\n", " 'DESC_KOR': '고량미,알곡',\n", " 'SERVING_WT': '100',\n", " 'FDGRP_NM': '곡류 및 그 제품',\n", " 'NUM': '1',\n", " 'ANIMAL_PLANT': ''},\n", " {'NUTR_CONT3': '11.4',\n", " 'NUTR_CONT2': '73.5',\n", " 'NUTR_CONT1': '332',\n", " 'FOOD_GROUP': '농촌진흥청 식품성분표 DB',\n", " 'BGN_YEAR': '2017',\n", " 'NUTR_CONT9': 'N/A',\n", " 'NUTR_CONT8': 'N/A',\n", " 'FOOD_CD': '100101000200200001',\n", " 'NUTR_CONT7': 'N/A',\n", " 'NUTR_CONT6': '2',\n", " 'NUTR_CONT5': 'N/A',\n", " 'NUTR_CONT4': '3.7',\n", " 'DESC_KOR': '겉귀리,생것',\n", " 'SERVING_WT': '100',\n", " 'FDGRP_NM': '곡류 및 그 제품',\n", " 'NUM': '2',\n", " 'ANIMAL_PLANT': ''},\n", " {'NUTR_CONT3': '14.3',\n", " 'NUTR_CONT2': '70.4',\n", " 'NUTR_CONT1': '334',\n", " 'FOOD_GROUP': '농촌진흥청 식품성분표 DB',\n", " 'BGN_YEAR': '2017',\n", " 'NUTR_CONT9': 'N/A',\n", " 'NUTR_CONT8': 'N/A',\n", " 'FOOD_CD': '100101000200300001',\n", " 'NUTR_CONT7': 'N/A',\n", " 'NUTR_CONT6': '3',\n", " 'NUTR_CONT5': 'N/A',\n", " 'NUTR_CONT4': '3.8',\n", " 'DESC_KOR': '쌀귀리,생것',\n", " 'SERVING_WT': '100',\n", " 'FDGRP_NM': '곡류 및 그 제품',\n", " 'NUM': '3',\n", " 'ANIMAL_PLANT': ''}]}}" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "from urllib import request, parse\n", "resp = request.urlopen(url)\n", "resp = parse.quote_plus(resp.read()) # Byte to String\n", "resp = parse.unquote_plus(resp) # Encoding Text\n", "\n", "import json\n", "resp = json.loads(resp) # Json 데이터를 Dict 으로 변환\n", "resp\n", "# len(resp['I0750']['row'])" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'info-000'" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "resp['I0750']['RESULT']['CODE'].lower()#.find(\"error\")" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "-1" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "resp['I0750']['RESULT']['CODE'].lower().find(\"error\")" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [], "source": [ "col_to_kor = {\n", " \"NUM\":\"번호\",\n", " \"FOOD_CD\":\"식품코드\",\n", " \"FDGRP_NM\":\"식품군\",\n", " \"DESC_KOR\":\"식품이름\",\n", " \"SERVING_WT\":\"1회제공량(g)\",\n", " \"NUTR_CONT1\":\"열량(kcal)(1회제공량당)\",\n", " \"NUTR_CONT2\":\"탄수화물(g)(1회제공량당)\",\n", " \"NUTR_CONT3\":\"단백질(g)(1회제공량당)\",\n", " \"NUTR_CONT4\":\"지방(g)(1회제공량당)\",\n", " \"NUTR_CONT5\":\"당류(g)(1회제공량당)\",\n", " \"NUTR_CONT6\":\"나트륨(mg)(1회제공량당)\",\n", " \"NUTR_CONT7\":\"콜레스테롤(mg)(1회제공량당)\",\n", " \"NUTR_CONT8\":\"포화지방산(g)(1회제공량당)\",\n", " \"NUTR_CONT9\":\"트랜스지방(g)(1회제공량당)\",\n", " \"ANIMAL_PLANT\":\"가공업체명\",\n", " \"BGN_YEAR\":\"구축년도\",\n", " \"FOOD_GROUP\":\"자료원\",\n", "}" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
    \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
    번호식품코드식품군식품이름1회제공량(g)열량(kcal)(1회제공량당)탄수화물(g)(1회제공량당)단백질(g)(1회제공량당)지방(g)(1회제공량당)당류(g)(1회제공량당)나트륨(mg)(1회제공량당)콜레스테롤(mg)(1회제공량당)포화지방산(g)(1회제공량당)트랜스지방(g)(1회제공량당)가공업체명구축년도자료원
    01100101000100000001곡류 및 그 제품고량미,알곡10034967.810.13.7N/AN/AN/AN/AN/A2001농촌진흥청 식품성분표 DB
    12100101000200200001곡류 및 그 제품겉귀리,생것10033273.511.43.7N/A2N/AN/AN/A2017농촌진흥청 식품성분표 DB
    23100101000200300001곡류 및 그 제품쌀귀리,생것10033470.414.33.8N/A3N/AN/AN/A2017농촌진흥청 식품성분표 DB
    \n", "
    " ], "text/plain": [ " 번호 식품코드 식품군 식품이름 1회제공량(g) 열량(kcal)(1회제공량당) \\\n", "0 1 100101000100000001 곡류 및 그 제품 고량미,알곡 100 349 \n", "1 2 100101000200200001 곡류 및 그 제품 겉귀리,생것 100 332 \n", "2 3 100101000200300001 곡류 및 그 제품 쌀귀리,생것 100 334 \n", "\n", " 탄수화물(g)(1회제공량당) 단백질(g)(1회제공량당) 지방(g)(1회제공량당) 당류(g)(1회제공량당) 나트륨(mg)(1회제공량당) \\\n", "0 67.8 10.1 3.7 N/A N/A \n", "1 73.5 11.4 3.7 N/A 2 \n", "2 70.4 14.3 3.8 N/A 3 \n", "\n", " 콜레스테롤(mg)(1회제공량당) 포화지방산(g)(1회제공량당) 트랜스지방(g)(1회제공량당) 가공업체명 구축년도 \\\n", "0 N/A N/A N/A 2001 \n", "1 N/A N/A N/A 2017 \n", "2 N/A N/A N/A 2017 \n", "\n", " 자료원 \n", "0 농촌진흥청 식품성분표 DB \n", "1 농촌진흥청 식품성분표 DB \n", "2 농촌진흥청 식품성분표 DB " ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "df = pd.DataFrame(resp['I0750']['row'])\n", "df = df.loc[:,list(col_to_kor.keys())] # header 정렬\n", "df.columns = [col_to_kor[_] for _ in list(df.columns)] \n", "# df.to_excel('data/food_nutrient.xls', index=False)\n", "df" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" } }, "nbformat": 4, "nbformat_minor": 4 }