{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# **Url Encoding / Parsing**\n",
"출처: https://dololak.tistory.com/255 [코끼리를 냉장고에 넣는 방법]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# **urllib 모듈을 사용한 경로명 작업**\n",
"\n",
"## **1 Url Encoding**\n",
"**한글 쿼리문의 인코딩** 해결 및 **Url Query** 생성 및 경로 연결\n",
"1. parse.**quote_plus** (\"%B8%D3%BD%C5%B7%AF%B4%D7\") \n",
"1. parse.**unquote_plus** (\"파이썬\", encoding='euc-kr')"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'머신러닝'"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 한글 인코딩/ 디코딩 문법\n",
"from urllib import parse\n",
"temp = \"%B8%D3%BD%C5%B7%AF%B4%D7\"\n",
"parse.unquote_plus(temp, encoding='euc-kr')"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'%B8%D3%BD%C5%B7%AF%B4%D7'"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"parse.quote_plus(\"머신러닝\", encoding='euc-kr')"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'https://www.kamis.or.kr/customer/price/retail/item.do?action=priceinfo'"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Base url 추출 및 추가경로 붙이기\n",
"from urllib import parse\n",
"url_root = \"https://www.kamis.or.kr/test/join\"\n",
"url_base = parse.urljoin(url_root, '/customer/price/retail/item.do?action=priceinfo')\n",
"url_base"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'examParam1=value1&examParam2=aaa&examParam2=bbb&Korquery=%ED%8C%8C%EC%9D%B4'"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"query = {\n",
" 'examParam1' : 'value1',\n",
" 'examParam2' : ['aaa', 'bbb'],\n",
" 'Korquery' : \"파이\"\n",
"} \n",
"parse.urlencode(query, encoding='UTF-8', doseq=True)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'regday=2019-10-18&itemcode=111&convert_kg_yn=N&itemcategorycode=100'"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Param 를 사용하여 Query Url 만들기\n",
"query = {\n",
" \"regday\":\"2019-10-18\",\n",
" \"itemcode\":\"111\",\n",
" \"convert_kg_yn\":\"N\",\n",
" \"itemcategorycode\":\"100\",\n",
"}\n",
"url_query = parse.urlencode(query, encoding='UTF-8', doseq=True)\n",
"url_query"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'https://www.kamis.or.kr/customer/price/retail/item.do?action=priceinfo®day=2019-10-18&itemcode=111&convert_kg_yn=N&itemcategorycode=100'"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Base url 과 Params 연결하기\n",
"url = url_base + \"&\" + url_query\n",
"url"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## **2 Url Parsing**\n",
"필요한 부분 추출하기"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"ParseResult(scheme='https', netloc='www.kamis.or.kr', path='/customer/price/retail/item.do', params='', query='action=priceinfo®day=2019-10-18&itemcode=111&convert_kg_yn=N&itemcategorycode=100', fragment='')"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 긴 Url 경로를 Parsing 하기\n",
"url_parse = parse.urlparse(url)\n",
"url_parse"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'www.kamis.or.kr'"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Root Url 추출하기 (url Net Location)\n",
"url_parse.netloc"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'/customer/price/retail/item.do'"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# url Add Paths\n",
"url_parse.path"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'action=priceinfo®day=2019-10-18&itemcode=111&convert_kg_yn=N&itemcategorycode=100'"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# url Query\n",
"url_parse.query"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"\n",
"# **Urllib 의 Request 크롤링**\n",
"**[한국 IT협회 교육내용 정리](https://leechamin.tistory.com/category/IT%20%ED%99%9C%EB%8F%99/%EC%9D%B8%EA%B3%B5%EC%A7%80%EB%8A%A5%EA%B5%90%EC%9C%A1%20-%20NLP?page=5)**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## **1 Request 를 사용한 크롤링**\n",
"Urllib 모듈의 **request** 사용"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"b'User-agent: Yeti\\nAllow: /main/imagemontage\\nDisallow: /\\nUser-agent: *\\nDisallow: /'"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Python 기본 모듈인 urllib 을 사용하여 Robots.txt 내용 확인하기\n",
"from urllib import request\n",
"url = \"https://news.naver.com/robots.txt\"\n",
"resp = request.urlopen(url)\n",
"resp.read() # 통으로 불러오기"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[b'\\t\\t\\t\\t\\t\\t\\t\\t\\t\\xeb\\xa1\\x9c\\xea\\xb7\\xb8\\xec\\x9d\\xb8\\r\\n',\n",
" b'\\t\\t\\t\\t\\t\\t\\t\\t\\r\\n',\n",
" b'\\t\\t\\t\\t\\t\\t\\t\\t
\\r\\n',\n",
" b'\\t\\t\\t\\t\\t\\t\\t\\t\\t\\xed\\x9a\\x8c\\xec\\x9b\\x90\\xea\\xb0\\x80\\xec\\x9e\\x85\\r\\n']"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 상세 페이지 크롤링\n",
"url = \"https://www.kamis.or.kr/customer/price/retail/item.do?action=priceinfo®day=2019-10-18&itemcode=111&convert_kg_yn=N&itemcategorycode=100\"\n",
"resp = request.urlopen(url)\n",
"r_test = resp.readlines()[103:107]\n",
"r_test"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## **2 한글 인코딩**\n",
"1. **Byte Text** 를 **str** 로 변환하고\n",
"1. **인코딩** 으로 **한글을 복구** 합니다."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(str,\n",
" '%09%09%09%09%09%09%09%09%09%3Ca+href%3D%22%2Fcustomer%2Fcustomer_service%2Fjoin%2Fcustomer_join.do%22%3E%ED%9A%8C%EC%9B%90%EA%B0%80%EC%9E%85%3C%2Fa%3E%0D%0A')"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from urllib import parse\n",
"txt_byte = parse.quote_plus(r_test[3])\n",
"type(txt_byte), txt_byte"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'\\t\\t\\t\\t\\t\\t\\t\\t\\t회원가입\\r\\n'"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 변환된 str 중 인코딩 문제를 해결\n",
"parse.unquote_plus(txt_byte)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## **3 한글의 초성, 중성, 조성 분리**\n",
"**[한글의 자음/모음 분해하기](https://frhyme.github.io/python/python_korean_englished/)**"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"\"\"\"Generate Strings from `c1` to `c2`, inclusive.\"\"\"\n",
"def str_range(c1, c2):\n",
" for c in range(ord(c1), ord(c2)+1):\n",
" yield chr(c)\n",
"\n",
"str_start = [_ for _ in [c for c in str_range('ㄱ', 'ㅎ')] # 초성 리스트\n",
" if _ not in ['ㄳ','ㄵ','ㄶ','ㄺ','ㄻ','ㄼ','ㄽ','ㄾ','ㄿ','ㅀ','ㅄ']]\n",
"str_mid = [c for c in str_range('ㅏ', 'ㅣ')] # 중성 리스트\n",
"str_last = [_ for _ in [\" \"] + [c for c in str_range('ㄱ', 'ㅎ')] \n",
" if _ not in ['ㄸ', 'ㅃ', 'ㅉ']] # 종성 리스트"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[['ㄷ', 'ㅚ', 'ㄴ'],\n",
" ['ㅈ', 'ㅏ', 'ㅇ'],\n",
" ['ㅉ', 'ㅣ', ' '],\n",
" ['ㄱ', 'ㅐ', ' '],\n",
" [' '],\n",
" ['ㅂ', 'ㅗ', 'ㄲ'],\n",
" ['ㅇ', 'ㅡ', 'ㅁ'],\n",
" ['a'],\n",
" ['b'],\n",
" ['c']]"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def str_korchar(str_text):\n",
" r_lst = []\n",
" for _ in list(str_text.strip()):\n",
" if '가' <= _ <= '힣': # 영어는 구분작성 \n",
" ch1 = (ord(_)-ord('가'))//588 # 588개 마다 초성변경 \n",
" ch2 = ((ord(_)-ord('가'))-(588*ch1))//28 # 중성은 총 28개\n",
" ch3 = (ord(_) - ord('가'))-(588*ch1)-28*ch2\n",
" r_lst.append([str_start[ch1], str_mid[ch2], str_last[ch3]])\n",
" else: r_lst.append([_])\n",
" return r_lst\n",
"\n",
"str_korchar(\"된장찌개 볶음abc\")"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['깍', '두', '기']"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"list(\"깍두기\")"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[['ㄲ', 'ㅏ', 'ㄱ'], ['ㄷ', 'ㅜ', ' '], ['ㄱ', 'ㅣ', ' ']]"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"str_korchar(\"깍두기\")"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(ord(\"깍\")-ord(\"가\"))//588"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"\n",
"# **Requests 크롤링**\n",
"**[Pypi](https://pypi.org/project/requests/) :** ! pip install requests "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## **1 Urllib 의 Parse 를 활용한 경로추출**\n",
"앞에서 작업했던 내용을 활용하여 경로명 만들기"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"SplitResult(scheme='https', netloc='www.kamis.or.kr', path='/customer/price/retail/item.do', query='action=priceinfo', fragment='')"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Base url 경로\n",
"url = \"https://www.kamis.or.kr/customer/price/retail/item.do?action=priceinfo\"\n",
"\n",
"from urllib import parse\n",
"parse.urlsplit(url)"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'®day=2019-10-18&itemcategorycode=100&itemcode=111&convert_kg_yn=N'"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"params = {\n",
" \"regday\" : \"2019-10-18\",\n",
" \"itemcategorycode\" : \"100\",\n",
" \"itemcode\" : \"111\",\n",
" \"convert_kg_yn\" : \"N\",\n",
"}\n",
"query = '&' + parse.urlencode(params, encoding='UTF-8')\n",
"query"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'https://www.kamis.or.kr/customer/price/retail/item.do?action=priceinfo®day=2019-10-18&itemcategorycode=100&itemcode=111&convert_kg_yn=N'"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"url + query"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 구분 | \n",
" 구분.1 | \n",
" 구분.2 | \n",
" 당일 10/18 | \n",
" 1일전10/17 | \n",
" 2일전10/16 | \n",
" 3일전10/15 | \n",
" 4일전10/14 | \n",
" 7일전10/11 | \n",
" 1개월전 | \n",
" 1년전 | \n",
" 평년 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 평균 | \n",
" 평균 | \n",
" 평균 | \n",
" 51469 | \n",
" 51475 | \n",
" 51484 | \n",
" 51695 | \n",
" 51695 | \n",
" 51318 | \n",
" 51553 | \n",
" 53253 | \n",
" 45758 | \n",
"
\n",
" \n",
" 1 | \n",
" 최고값 | \n",
" 최고값 | \n",
" 최고값 | \n",
" 58000 | \n",
" 58000 | \n",
" 58000 | \n",
" 58000 | \n",
" 58000 | \n",
" 56000 | \n",
" 58000 | \n",
" 62500 | \n",
" 59600 | \n",
"
\n",
" \n",
" 2 | \n",
" 최저값 | \n",
" 최저값 | \n",
" 최저값 | \n",
" 44900 | \n",
" 44900 | \n",
" 44900 | \n",
" 45900 | \n",
" 45900 | \n",
" 45900 | \n",
" 47800 | \n",
" 49000 | \n",
" 33800 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 구분 구분.1 구분.2 당일 10/18 1일전10/17 2일전10/16 3일전10/15 4일전10/14 7일전10/11 1개월전 \\\n",
"0 평균 평균 평균 51469 51475 51484 51695 51695 51318 51553 \n",
"1 최고값 최고값 최고값 58000 58000 58000 58000 58000 56000 58000 \n",
"2 최저값 최저값 최저값 44900 44900 44900 45900 45900 45900 47800 \n",
"\n",
" 1년전 평년 \n",
"0 53253 45758 \n",
"1 62500 59600 \n",
"2 49000 33800 "
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"pd.read_html(url + query)[3].head(3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## **2 Requests 를 사용한 Get 크롤링**\n",
"1. Url 의 **pararm** 경로와, 가상 **web Header** 인코딩을 자동처리 합니다.\n",
"1. https://pypi.org/project/fake-useragent/"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"# Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US);\n",
"# Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; GTB7.4; InfoPath.2; SV1; .NET CLR 3.3.69573; WOW64; en-US)\n",
"# Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.2 (KHTML, like Gecko) Chrome/22.0.1216.0 Safari/537.2'\n",
"# Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/537.13 (KHTML, like Gecko) Chrome/24.0.1290.1 Safari/537.13\n",
"# Mozilla/5.0 (Windows NT 6.2; Win64; x64; rv:16.0.1) Gecko/20121011 Firefox/16.0.1\n",
"# Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:15.0) Gecko/20100101 Firefox/15.0.1\n",
"# Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11\n",
"# Mozilla/5.0 (iPad; CPU OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5355d Safari/8536.25"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'\\t\\t\\t농수산식품기업지원센터\\r\\n\\t\\t\\t\\t\\taT메인\\r\\n\\t\\t\\t\\t\\r\\n\\t\\t\\t\\r\\n\\t\\t\\t\\r\\n\\t\\t\\t\\t<'"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"url = \"https://www.kamis.or.kr/customer/price/retail/item.do?action=priceinfo\"\n",
"params = {\n",
" \"regday\" : \"2019-10-18\",\n",
" \"itemcategorycode\" : \"100\",\n",
" \"itemcode\" : \"111\",\n",
" \"convert_kg_yn\" : \"N\",\n",
"}\n",
"\n",
"import requests\n",
"web = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.2 (KHTML, like Gecko) Chrome/22.0.1216.0 Safari/537.2'\n",
"headers = {'User-Agent' : web}\n",
"resp = requests.get(url, params=params, headers=headers)\n",
"resp.text[2800:3000]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## **2 Requests 를 사용한 Post 크롤링**\n",
"1. Url 의 **pararm** 경로와, 가상 **web Header** 인코딩을 자동처리 합니다.\n",
"1. https://pypi.org/project/fake-useragent/"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"b'kR4BVNFag+szu8hB3hhNym4wnhHiECpalBD0QD5gSoY4/Mjy4rk9q1LU0QEBvNx3VL/nX4om315BLTjIX61DApKASSKfgGAp4Jqixd5FS6sKsv5LGpssDJ7Zl00V37JevrpGopSxpJATyEj4YPPUfOCnedA+9U5lnd06bVaLHpljzFThxW7yJKPGJ2/gVaY36o75kXbijHhx476kbDn1h7fJpvPTvq+bUW+PV6jAxw11fjqegl6QlsJCpllml/kOmriWaPY/1wrCgpqPf1MfAO0M5UAH9uwubXKtNjgkp5w='"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# OTP key 값 추출하기\n",
"# KRX 주가 데이터 수집예제\n",
"gen_otp_url = 'http://marketdata.krx.co.kr/contents/COM/GenerateOTP.jspx'\n",
"gen_otp_data = {\n",
" 'name':'fileDown', \n",
" 'filetype':'xls',\n",
" 'market_gubun':'ALL', #시장구분: ALL=전체\n",
" 'url':'MKD/04/0404/04040200/mkd04040200_01',\n",
" 'schdate':'20191018',\n",
" 'pagePath':'/contents/MKD/04/0404/04040200/MKD04040200.jsp'\n",
"}\n",
"r = requests.post(gen_otp_url, gen_otp_data)\n",
"OTP_code = r.content\n",
"OTP_code"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"b'\\r\\n\\r\\n\\r\\n\\r\\n\\r\\n\\r\\n\\t
\\xea\\xb0\\x80\\xea\\xb2\\xa9\\xec\\xa0\\x95\\xeb\\xb3\\xb4 - \\xec\\xb9\\x9c\\xed\\x99\\x98\\xea\\xb2\\xbd\\xeb\\x86\\x8d\\xec\\x82\\xb0\\xeb\\xac\\xbc - \\xed\\x92\\x88\\xeb\\xaa\\xa9\\xeb\\xb3\\x84 \\xec\\x97\\x91\\xec\\x85\\x80 \\xec\\xb6\\x9c\\xeb\\xa0\\xa5\\r\\n\\t
\\r\\n\\t\\r\\n\\r\\n\\r\\n\\t
\\r\\n\\t\\t\\r\\n\\t\\t\\t\\r\\n\\t\\t\\t\\t\\r\\n\\t\\t\\t\\t\\t\\r\\n\\t\\t\\t\\t \\r\\n\\t\\t\\t | \\r\\n\\t\\t
\\r\\n\\t
\\r\\n\\r\\n'"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data = {\n",
" \"Accept\" :\"text/html,application/xhtml+xm…plication/xml;q=0.9,*/*;q=0.8\",\n",
" \"Accept-Encoding\" : \"gzip, deflate, br\",\n",
" \"Accept-Language\" : \"ko-KR,ko;q=0.8,en-US;q=0.5,en;q=0.3\",\n",
" \"Connection\" : \"keep-alive\",\n",
" \"Content-Length\" : \"76940\",\n",
" \"Content-Type\" : \"application/x-www-form-urlencoded\",\n",
" \"Host\" : \"www.kamis.or.kr\",\n",
" \"Referer\" : url + query,\n",
" \"Upgrade-Insecure-Requests\" : \"1\",\n",
" \"User-Agent\" : \"Mozilla/5.0 (X11; Ubuntu; Linu…) Gecko/20100101 Firefox/69.0\"\n",
"}\n",
"resp = requests.post(\"https://www.kamis.or.kr/jsp/common/excel_util.jsp\", data=data)\n",
"resp.content"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## **3 KAMIS 데이터 크롤링**\n",
"Get 방식으로 기존의 lxml 모듈로 수집하기"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"
"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"url = \"https://www.kamis.or.kr/customer/price/retail/item.do?action=priceinfo\"\n",
"params = {\n",
" \"regday\" : \"2015-10-14\",\n",
" \"itemcategorycode\" : \"100\",\n",
" \"itemcode\" : \"111\",\n",
" \"convert_kg_yn\" : \"N\",\n",
"}\n",
"\n",
"import requests\n",
"web = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.2 (KHTML, like Gecko) Chrome/22.0.1216.0 Safari/537.2'\n",
"headers = {'User-Agent' : web}\n",
"resp = requests.get(url, params=params, headers=headers)\n",
"resp"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['45042', '50000', '38800']"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from lxml.html import fromstring, tostring\n",
"resp_lxml = fromstring(resp.text)\n",
"resp_table = resp_lxml.xpath('//table[@id=\"itemTable_1\"]')[0]\n",
"\n",
"import pandas as pd\n",
"table = pd.read_html(tostring(resp_table))[0]\n",
"table_idx = [_ for _, txt in enumerate(list(table.columns)) if txt.find('당일') != -1]\n",
"\n",
"if len(table_idx) == 1:\n",
" table_idx = table_idx[0]\n",
" table = table.iloc[:,[0, table_idx]].set_index(\n",
" table.columns[0]).loc[[\"평균\",\"최고값\",\"최저값\"]]\n",
" table = list(table.to_dict(orient='list').values())[0]\n",
"else:\n",
" table = None\n",
"table"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"\n",
"# **API Key 를 활용한 크롤링**\n",
"\n",
"## **1 식품안전나라 안내 Page**\n",
"1. **[Open API 메인 페이지](https://www.foodsafetykorea.go.kr/api/userApiKey.do#)**, **[API 활용 방법](https://www.foodsafetykorea.go.kr/api/howToUseApi.do?menu_grp=MENU_GRP34&menu_no=687)**, **[API 이용절차](https://www.foodsafetykorea.go.kr/api/board/boardDetail.do)**\n",
"\n",
"상세페이지 바로가기"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 조리 레시피 API 안내 페이지 https://www.foodsafetykorea.go.kr/api/openApiInfo.do\n",
"query = {\n",
" \"menu_no\":\"849\",\n",
" \"menu_grp\":\"MENU_GRP31\",\n",
" \"start_idx\":\"120\",\n",
" \"svc_no\":\"COOKRCP01\",\n",
" \"svc_type_cd\":\"API_TYPE06\",\n",
" \"show_cnt\":\"10\",\n",
" \"Referer\":\"https://www.foodsafetykorea.go.kr/api/sheetInfo.do\",\n",
"}\n",
"import requests\n",
"resp = requests.post(\"https://www.foodsafetykorea.go.kr/api/openApiInfo.do\", data=query)\n",
"resp\n",
"# with open(\"food.html\", \"w\") as f:\n",
"# f.write(resp.text)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## **2 API 크롤링**\n",
"1. **[Open API 메인 페이지](https://www.foodsafetykorea.go.kr/api/userApiKey.do#)**, **[API 활용 방법](https://www.foodsafetykorea.go.kr/api/howToUseApi.do?menu_grp=MENU_GRP34&menu_no=687)**, **[API 이용절차](https://www.foodsafetykorea.go.kr/api/board/boardDetail.do)**\n",
"1. key : \"8acba1823ae742359560\"\n",
"\n",
"**\"keyId\"** 부분에 **인증키를** 입력하고, **\"serviceId\"에는** 미리보기 주소에 있는 **serviceId** 를 사용\n",
"\n",
"ex) http://openapi.foodsafetykorea.go.kr/api/인증키/I0580/xml/1/20 \n",
"\n",
"1. HACCP 적용업소 지정현황 **'I0580'** 사용\n",
"1. **'dataType'** 에서 **Xml, Json** 중 타입을 한가지 입력\n",
"1. **'startIdx'** 에서 **시작 정보 행 순번** (1번부터 : '1')\n",
"1. **'endIdx'** 에서 **마지막 행 순번** (20번까지 : '20')"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [],
"source": [
"# http://openapi.foodsafetykorea.go.kr/api/keyId/serviceId/dataType/startIdx/endIdx\n",
"\n",
"url = \"http://openapi.foodsafetykorea.go.kr/api\" # sample/I2620/xml/1/5\"\n",
"keyId = \"8acba1823ae742359560\"\n",
"serviceId = \"I0750\" # \"I0580\" # I0750\n",
"dataType = \"json\"\n",
"startNum = \"1\"\n",
"endNum = \"3\"\n",
"url = \"/\".join([url, keyId, serviceId, dataType, startNum, endNum])"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 7.15 ms, sys: 402 µs, total: 7.55 ms\n",
"Wall time: 22.7 s\n"
]
},
{
"data": {
"text/plain": [
"{'I0750': {'RESULT': {'MSG': '정상처리되었습니다.', 'CODE': 'INFO-000'},\n",
" 'total_count': '13824',\n",
" 'row': [{'NUTR_CONT3': '10.1',\n",
" 'NUTR_CONT2': '67.8',\n",
" 'NUTR_CONT1': '349',\n",
" 'FOOD_GROUP': '농촌진흥청 식품성분표 DB',\n",
" 'BGN_YEAR': '2001',\n",
" 'NUTR_CONT9': 'N/A',\n",
" 'NUTR_CONT8': 'N/A',\n",
" 'FOOD_CD': '100101000100000001',\n",
" 'NUTR_CONT7': 'N/A',\n",
" 'NUTR_CONT6': 'N/A',\n",
" 'NUTR_CONT5': 'N/A',\n",
" 'NUTR_CONT4': '3.7',\n",
" 'DESC_KOR': '고량미,알곡',\n",
" 'SERVING_WT': '100',\n",
" 'FDGRP_NM': '곡류 및 그 제품',\n",
" 'NUM': '1',\n",
" 'ANIMAL_PLANT': ''},\n",
" {'NUTR_CONT3': '11.4',\n",
" 'NUTR_CONT2': '73.5',\n",
" 'NUTR_CONT1': '332',\n",
" 'FOOD_GROUP': '농촌진흥청 식품성분표 DB',\n",
" 'BGN_YEAR': '2017',\n",
" 'NUTR_CONT9': 'N/A',\n",
" 'NUTR_CONT8': 'N/A',\n",
" 'FOOD_CD': '100101000200200001',\n",
" 'NUTR_CONT7': 'N/A',\n",
" 'NUTR_CONT6': '2',\n",
" 'NUTR_CONT5': 'N/A',\n",
" 'NUTR_CONT4': '3.7',\n",
" 'DESC_KOR': '겉귀리,생것',\n",
" 'SERVING_WT': '100',\n",
" 'FDGRP_NM': '곡류 및 그 제품',\n",
" 'NUM': '2',\n",
" 'ANIMAL_PLANT': ''},\n",
" {'NUTR_CONT3': '14.3',\n",
" 'NUTR_CONT2': '70.4',\n",
" 'NUTR_CONT1': '334',\n",
" 'FOOD_GROUP': '농촌진흥청 식품성분표 DB',\n",
" 'BGN_YEAR': '2017',\n",
" 'NUTR_CONT9': 'N/A',\n",
" 'NUTR_CONT8': 'N/A',\n",
" 'FOOD_CD': '100101000200300001',\n",
" 'NUTR_CONT7': 'N/A',\n",
" 'NUTR_CONT6': '3',\n",
" 'NUTR_CONT5': 'N/A',\n",
" 'NUTR_CONT4': '3.8',\n",
" 'DESC_KOR': '쌀귀리,생것',\n",
" 'SERVING_WT': '100',\n",
" 'FDGRP_NM': '곡류 및 그 제품',\n",
" 'NUM': '3',\n",
" 'ANIMAL_PLANT': ''}]}}"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%time\n",
"from urllib import request, parse\n",
"resp = request.urlopen(url)\n",
"resp = parse.quote_plus(resp.read()) # Byte to String\n",
"resp = parse.unquote_plus(resp) # Encoding Text\n",
"\n",
"import json\n",
"resp = json.loads(resp) # Json 데이터를 Dict 으로 변환\n",
"resp\n",
"# len(resp['I0750']['row'])"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'info-000'"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"resp['I0750']['RESULT']['CODE'].lower()#.find(\"error\")"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"-1"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"resp['I0750']['RESULT']['CODE'].lower().find(\"error\")"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [],
"source": [
"col_to_kor = {\n",
" \"NUM\":\"번호\",\n",
" \"FOOD_CD\":\"식품코드\",\n",
" \"FDGRP_NM\":\"식품군\",\n",
" \"DESC_KOR\":\"식품이름\",\n",
" \"SERVING_WT\":\"1회제공량(g)\",\n",
" \"NUTR_CONT1\":\"열량(kcal)(1회제공량당)\",\n",
" \"NUTR_CONT2\":\"탄수화물(g)(1회제공량당)\",\n",
" \"NUTR_CONT3\":\"단백질(g)(1회제공량당)\",\n",
" \"NUTR_CONT4\":\"지방(g)(1회제공량당)\",\n",
" \"NUTR_CONT5\":\"당류(g)(1회제공량당)\",\n",
" \"NUTR_CONT6\":\"나트륨(mg)(1회제공량당)\",\n",
" \"NUTR_CONT7\":\"콜레스테롤(mg)(1회제공량당)\",\n",
" \"NUTR_CONT8\":\"포화지방산(g)(1회제공량당)\",\n",
" \"NUTR_CONT9\":\"트랜스지방(g)(1회제공량당)\",\n",
" \"ANIMAL_PLANT\":\"가공업체명\",\n",
" \"BGN_YEAR\":\"구축년도\",\n",
" \"FOOD_GROUP\":\"자료원\",\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 번호 | \n",
" 식품코드 | \n",
" 식품군 | \n",
" 식품이름 | \n",
" 1회제공량(g) | \n",
" 열량(kcal)(1회제공량당) | \n",
" 탄수화물(g)(1회제공량당) | \n",
" 단백질(g)(1회제공량당) | \n",
" 지방(g)(1회제공량당) | \n",
" 당류(g)(1회제공량당) | \n",
" 나트륨(mg)(1회제공량당) | \n",
" 콜레스테롤(mg)(1회제공량당) | \n",
" 포화지방산(g)(1회제공량당) | \n",
" 트랜스지방(g)(1회제공량당) | \n",
" 가공업체명 | \n",
" 구축년도 | \n",
" 자료원 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 100101000100000001 | \n",
" 곡류 및 그 제품 | \n",
" 고량미,알곡 | \n",
" 100 | \n",
" 349 | \n",
" 67.8 | \n",
" 10.1 | \n",
" 3.7 | \n",
" N/A | \n",
" N/A | \n",
" N/A | \n",
" N/A | \n",
" N/A | \n",
" | \n",
" 2001 | \n",
" 농촌진흥청 식품성분표 DB | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" 100101000200200001 | \n",
" 곡류 및 그 제품 | \n",
" 겉귀리,생것 | \n",
" 100 | \n",
" 332 | \n",
" 73.5 | \n",
" 11.4 | \n",
" 3.7 | \n",
" N/A | \n",
" 2 | \n",
" N/A | \n",
" N/A | \n",
" N/A | \n",
" | \n",
" 2017 | \n",
" 농촌진흥청 식품성분표 DB | \n",
"
\n",
" \n",
" 2 | \n",
" 3 | \n",
" 100101000200300001 | \n",
" 곡류 및 그 제품 | \n",
" 쌀귀리,생것 | \n",
" 100 | \n",
" 334 | \n",
" 70.4 | \n",
" 14.3 | \n",
" 3.8 | \n",
" N/A | \n",
" 3 | \n",
" N/A | \n",
" N/A | \n",
" N/A | \n",
" | \n",
" 2017 | \n",
" 농촌진흥청 식품성분표 DB | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 번호 식품코드 식품군 식품이름 1회제공량(g) 열량(kcal)(1회제공량당) \\\n",
"0 1 100101000100000001 곡류 및 그 제품 고량미,알곡 100 349 \n",
"1 2 100101000200200001 곡류 및 그 제품 겉귀리,생것 100 332 \n",
"2 3 100101000200300001 곡류 및 그 제품 쌀귀리,생것 100 334 \n",
"\n",
" 탄수화물(g)(1회제공량당) 단백질(g)(1회제공량당) 지방(g)(1회제공량당) 당류(g)(1회제공량당) 나트륨(mg)(1회제공량당) \\\n",
"0 67.8 10.1 3.7 N/A N/A \n",
"1 73.5 11.4 3.7 N/A 2 \n",
"2 70.4 14.3 3.8 N/A 3 \n",
"\n",
" 콜레스테롤(mg)(1회제공량당) 포화지방산(g)(1회제공량당) 트랜스지방(g)(1회제공량당) 가공업체명 구축년도 \\\n",
"0 N/A N/A N/A 2001 \n",
"1 N/A N/A N/A 2017 \n",
"2 N/A N/A N/A 2017 \n",
"\n",
" 자료원 \n",
"0 농촌진흥청 식품성분표 DB \n",
"1 농촌진흥청 식품성분표 DB \n",
"2 농촌진흥청 식품성분표 DB "
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"df = pd.DataFrame(resp['I0750']['row'])\n",
"df = df.loc[:,list(col_to_kor.keys())] # header 정렬\n",
"df.columns = [col_to_kor[_] for _ in list(df.columns)] \n",
"# df.to_excel('data/food_nutrient.xls', index=False)\n",
"df"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
}
},
"nbformat": 4,
"nbformat_minor": 4
}