{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# **식품안전나라**\n", "ex) http://openapi.foodsafetykorea.go.kr/api/인증키/I0580/xml/1/20 \n", "\n", "1. **[Open API 메인 페이지](https://www.foodsafetykorea.go.kr/api/userApiKey.do#)**, **[API 활용 방법](https://www.foodsafetykorea.go.kr/api/howToUseApi.do?menu_grp=MENU_GRP34&menu_no=687)**, **[API 이용절차](https://www.foodsafetykorea.go.kr/api/board/boardDetail.do)**\n", "1. **[영양사협회 영양소 섭취기준표](http://www.kns.or.kr/FileRoom/FileRoom_view.asp?mode=mod&restring=%252FFileRoom%252FFileRoom.asp%253Fxsearch%253D0%253D%253Dxrow%253D10%253D%253DBoardID%253DKdr%253D%253Dpage%253D1&idx=79&page=1&BoardID=Kdr&xsearch=1&cn_search)**\n", "1. key : \"8acba1823ae742359560\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **1 함수로 만든내용 확인**\n", "식품의약처 API 내용 활용함수 : **Json / DataFrame** 으로 출력\n", "1. **'C005' :** '바코드제품정보'\n", "1. **'I2570' :** '유통바코드'\n", "1. **'I0490' :** '회수판매중지'\n", "1. **'I0750' :** '식품영양정보DB'\n", "1. **'COOKRCP01' :** '조리식품_레시피_DB'" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "ename": "ModuleNotFoundError", "evalue": "No module named 'momukji'", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mModuleNotFoundError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mpandas\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0mpd\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0mtqdm\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mtqdm\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 3\u001b[0;31m \u001b[0;32mfrom\u001b[0m \u001b[0mmomukji\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mFoodSafe\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 4\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0m_\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mFoodSafe\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mapiKey\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0m_\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'name'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0m_\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mFoodSafe\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mapiKey\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mkeys\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mModuleNotFoundError\u001b[0m: No module named 'momukji'" ] } ], "source": [ "import pandas as pd\n", "from tqdm import tqdm\n", "from momukji import FoodSafe\n", "[(_, FoodSafe().apiKey[_]['name']) for _ in FoodSafe().apiKey.keys()]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "result, foodId = [], 'I0490' # 회수요청제품 (400개)\n", "data = FoodSafe().getData(foodId, 1, 1000, FoodSafe().apiKey[foodId]['cols'], display=True)\n", "data.to_excel(\"data/food_recall.xls\", index=None)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "result, foodId = [], 'I0750' # 식품 영양정보\n", "_ = FoodSafe().getData(foodId, 1, 2, FoodSafe().apiKey[foodId]['cols'], display=True)\n", "# for _ in tqdm(range(1, 13824+1, 1000)):\n", "# result.append(FoodSafe().getData(foodId, _, _+999, FoodSafe().apiKey[foodId]['cols'])) \n", "# pd.concat(result).to_excel('data/food_nutrient.xls', index=None)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "result, foodId = [], 'COOKRCP01' # 레시피 데이터\n", "_ = FoodSafe().getData(foodId, 1, 2, FoodSafe().apiKey[foodId]['cols'], display=True)\n", "# for _ in tqdm(range(1, 1500+1, 1000)):\n", "# result.append(FoodSafe().getData(foodId, _, _+999, FoodSafe().apiKey[foodId]['cols']))\n", "# pd.concat(result).to_excel('data/food_recipe_info.xls', index=None)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "result, foodId = [], 'C005' # 제품 바코드 번호\n", "# _ = FoodSafe().getData(foodId, 1, 2, FoodSafe().apiKey[foodId]['cols'], display=True)\n", "# for _ in tqdm(range(1, 100200+1, 1000)):\n", "from collections import OrderedDict\n", "for _ in tqdm(range(1, 100200+1, 1000)):\n", " result.append(FoodSafe().getData(foodId, _, _+999, \n", " FoodSafe().apiKey[foodId]['cols']).loc[:,\n", " list(OrderedDict(FoodSafe().apiKey['C005']['cols']).values())])\n", "pd.concat(result).to_csv('data/food_barcode.csv', index=None)\n", "pd.concat(result).to_excel('data/food_barcode.xls', index=None)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "result, foodId = [], \"I2570\" # 유통 바코드\n", "_ = FoodSafe().getData(foodId, 1, 2, FoodSafe().apiKey[foodId]['cols'], display=True)\n", "# for _ in tqdm(range(1, 49000+1, 1000)):\n", "# result.append(FoodSafe().getData(foodId, _, _+999, FoodSafe().apiKey[foodId]['cols']))\n", "# pd.concat(result).to_excel('data/food_barcode_info.xls', index=None)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "# **Kamis 식재료 정보수집**\n", "**[Open API 관리 페이지](https://www.kamis.or.kr/customer/mypage/my_openapi/my_openapi.do)**\n", "1. **[Regex Html Crawling](https://stackoverflow.com/questions/13823479/finding-all-tr-from-html-table-in-python)**\n", "1. key : \"fb54dcd7-218f-4297-8fa6-31cfcd0a897d\"\n", "\n", "```\n", "'(.+?)' # 특정태그\n", "\"<[^>]+>|[^<]+\" # html 태그 내부의 한글 추출\n", "'<.*?>' # 모든 태그\n", "reg_table = '(.+?)'\n", "reg_table = '(.+?)'\n", "reg_table = '<.*?>'\n", "reg_table = \"(.*?)\"\n", "reg_table = '(.*?)
'\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **1 Web 페이지 크롤링 함수 확인**\n", "Kamis 식품정보 수집" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "from momukji import Kamis\n", "Kamis().getData('2019-10-15', \"100\").head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "from momukji import Kamis\n", "Kamis().getData('2019-10-15', \"100\" ,types=2).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **2 API Json 호출**\n", "1. **[Kamis OpenAPI Key 매뉴얼 페이지](https://www.kamis.or.kr/customer/reference/openapi_list.do)**\n", "1. **[Kamis Code 상세 페이지](https://www.kamis.or.kr/customer/reference/openapi_list.do?action=detail&boardno=5)** | **[Code 다운로드](http://www.kamis.or.kr/customer/board/board_file.do?brdno=4&brdctsno=424245&brdctsfileno=12212)**\n", "1. **[Kamis OpenAPI](https://www.kamis.or.kr/customer/reference/openapi_list.do?action=detail&boardno=8)** 식품정보 수집" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from momukji import Kamis\n", "resp = Kamis().getApi('2019-10-21', 500, cd=2)\n", "resp.head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "# **서울시 농수산 식품공사 API**\n", "1. https://www.garakprice.com/index.php?go_date=20191001 가락동 농산물 시세 Web\n", "1. http://www.garak.co.kr/gongsa/jsp/gs/intro/common.jsp JSP 호출주소\n", "\n", "**[서울시 농산물 공사 OpenAPI](https://www.garak.co.kr/publicdata/selectPageListPublicData.do?sch_public_data_realm_code=1)**\n", "\n", "```html\n", "http://www.garak.co.kr/gongsa/jsp/gs/data_open/data.jsp?id=2087&passwd=1004&dataid=data4&pagesize=10\n", "&pageidx=1&portal.templet=false&p_ymd=20190408&p_jymd=20190408&d_cd=2&p_jjymd=20130429\n", "&p_pos_gubun=1&pum_nm=\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **1 API 를 활용한 수집**\n", "가격정보를 **Xml API** 를 활용하여 수집" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# 문제점...\n", "# Page 를 순환하면서 1개의 객체만 있으면 list 갯수가 1개로만 출력\n", "\n", "# 해결방법... (https://rfriend.tistory.com/482)\n", "# 1개의 dict 을 DataFrame 변환 시, 데이터를 list로 변경해서 적용\n", "# 그래야 컬럼, 데이터와 \"인덱스\" 를 자동으로 계산할 수 있다." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "date_info = '20191022'\n", "\n", "import pandas as pd\n", "from momukji import Garak\n", "xml_d_1 = Garak().getData(cd=1) # date_info, \n", "xml_d_2 = Garak().getData(cd=2) # date_info, \n", "xml_data = pd.concat([xml_d_1, xml_d_2]).reset_index(drop=True)\n", "xml_data.to_excel(\"data/food_Garak.xls\", index=None)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **2 Web 페이지를 활용한 수집기**\n", "https://www.garakprice.com/index.php?go_date=20191021\n", "1. 하지만 품목수가 부족해서 실익이 적음" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# https://www.garakprice.com/index.php?go_date=20191021\n", "from urllib import parse, request\n", "def urllib_request(url, params, encode='utf-8'):\n", " params = parse.urlencode(params).encode(encode)\n", " url = request.Request(url, params)\n", " resp = request.data/lopen(url).read()\n", " resp = parse.quote_plus(resp)\n", " return parse.unquote_plus(resp)\n", "\n", "# params = { \"go_date\":20191021 }\n", "# url = \"https://www.garakprice.com/index.php\"\n", "# urllib_request(url, params)[:500]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "# **기타 공산품 가격정보**\n", "\n", "**[Data.go.kr 공산품 가격정보](https://www.data.go.kr/dataset/3043385/openapi.do?lang=ko)**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **1 공산품 Id 정보수집**\n", "**[Data.go.kr 공산품 가격정보](https://www.data.go.kr/dataset/3043385/openapi.do?lang=ko)**\n", "1. **[한국소비자원 참가격](https://www.price.go.kr/tprice/portal/pricenewsandtpriceintro/iteminfo/getItemList.do)**\n", "1. **[유통상품지식뱅크](http://35.200.32.201/)**\n", "1. 분류기준 명확하게 찾기" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from momukji import Product\n", "item_list = Product().getList()\n", "item_list.head(3)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "url = \"http://ksetup.com/bbs/page.php?hid=code\"\n", "\n", "import re\n", "import pandas as pd\n", "from urllib import request, parse\n", "resp = request.urlopen(url).read()\n", "resp = parse.quote_plus(resp)\n", "resp = parse.unquote_plus(resp)\n", "table = re.findall(r'(.*?)
', resp, re.M|re.I|re.S)\n", "table = \"\" + table[0] + \"
\"\n", "table = pd.read_html(table)[0]\n", "# table.to_excel('company_code.xls', index=None, columns=None)\n", "table.head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **2 공산품 인터넷 쇼핑몰 정보**\n", "1. 네이버 쇼핑/ 핫딜\n", "1. 다음쇼핑 등 정리한 내용 재정의 및 복습\n", "1. 유통정보센터 식품관련 데이터 수집 및 정리" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "# **급식메뉴 안내**\n", "**[Data.go.kr 공산품 가격정보](https://www.data.go.kr/dataset/3043385/openapi.do?lang=ko)**\n", "\n", "**[NEIS 급식메뉴 조회 수집 사이트](https://stu.gen.go.kr/sts_sci_md00_001.do?schulCode=F100000120&schulCrseScCode=4&schulKndScCode=04&ay=2019&mm=10)**\n", "```\n", "https://stu.gen.go.kr/sts_sci_md00_001.do?schulCode=F100000120&schulCrseScCode=4&schulKndScCode=04&ay=2019&mm=12\n", "\n", "https://stu.gen.go.kr/sts_sci_md00_001.do?\n", " schulCode=F100000120\n", " &schulCrseScCode=4\n", " &schulKndScCode=04\n", " &ay=2019\n", " &mm=10\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"{:02d}\".format(2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "query = {\n", " \"schulCode\":\"F100000120\",\n", " \"schulCrseScCode\":4,\n", " \"schulKndScCode\":\"04\",\n", " \"ay\":2019, # 년도\n", " \"mm\":10 # 월\n", "}\n", "from urllib import parse, request\n", "url = \"https://stu.gen.go.kr/sts_sci_md00_001.do?\" + parse.urlencode(query)\n", "resp = request.urlopen(url).read()\n", "resp = parse.quote_plus(resp)\n", "resp = parse.unquote_plus(resp)\n", "resp[:200]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import re\n", "re.findall()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ ", \"/sts_sci_md00_001.do?\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from menu_parser import MenuParser\n", "from school import School\n", "school = School(School.Region.GWANGJU, School.Type.HIGH, \"F100000120\")\n", "parser = MenuParser(school)\n", "menus = parser.get_menu()\n", "print(menus.today)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class School:\n", " class Region:\n", " BUSAN = \"stu.pen.go.kr\"\n", " CHUNGBUK = \"stu.cbe.go.kr\"\n", " CHUNGNAM = \"stu.cne.go.kr\"\n", " DAEJEON = \"stu.dge.go.kr\"\n", " DEAGU = \"stu.dge.go.kr\"\n", " GWANGJU = \"stu.gen.go.kr\"\n", " GYEONGBUK = \"stu.gbe.go.kr\"\n", " GYEONGGI = \"stu.goe.go.kr\"\n", " GYEONGNAM = \"stu.gne.go.kr\"\n", " INCHEON = \"stu.ice.go.kr\"\n", " JEJU = \"stu.jje.go.kr\"\n", " JEONBUK = \"stu.jbe.go.kr\"\n", " JEONNAM = \"stu.jne.go.kr\"\n", " KANGWON = \"stu.kwe.go.kr\"\n", " SEJONG = \"stu.sje.go.kr\"\n", " SEOUL = \"stu.sen.go.kr\"\n", " ULSAN = \"stu.use.go.kr\"\n", "\n", " class Type:\n", " KINDERGARTEN = 1\n", " ELEMENTARY = 2\n", " MIDDLE = 3\n", " HIGH = 4\n", "\n", " def __init__(self, school_region, school_type, school_code):\n", " self.region = school_region\n", " self.type = school_type\n", " self.code = school_code\n", "\n", "school = School(School.Region.GWANGJU, School.Type.HIGH, \"F100000120\")\n", "school" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class MenuParser:\n", " \n", " def __init__(self, school):\n", " self.school = school\n", "\n", " def get_menu(self, year=None, month=None):\n", " \"\"\"\n", " 해당 학교로부터 급식을 가져온다.\n", " year와 month가 모두 주어졌다면 해당하는 정보를 가져온다.\n", " 주어지지 않았을 때에는 자동으로 가져오게 된다.\n", " \"\"\"\n", " if year is None or month is None:\n", " today = datetime.date.today()\n", " url = self.__create_url(today.year, today.month)\n", " else:\n", " url = self.__create_url(year, month)\n", " page = self.__get_page(url); print(url)\n", " soup = BeautifulSoup(page, \"html.parser\")\n", " items = soup.select(\"#contents > div > table > tbody > tr > td > div\")\n", " return Menu(items)\n", "\n", " def __get_page(self, url):\n", " try:\n", " page = requests.get(url).text \n", " except Exception as e:\n", " logging.error(e)\n", " return page\n", "\n", " def __create_url(self, year, month):\n", " today = datetime.date(year, month, 1)\n", " url = f\"https://{self.school.region}/sts_sci_md00_001.do?\"\n", " url += f\"schulCode={self.school.code}&\"\n", " url += f\"schulCrseScCode={self.school.type}&\"\n", " url += f\"schulKndScCode={self.school.type:02d}&\"\n", " url += f\"ay={today.year}&\"\n", " url += f\"mm={today.month:02d}\"\n", " print(url)\n", " return url" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "parser = MenuParser(school)\n", "parser" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def __create_url(self, year, month):\n", " today = datetime.date(year, month, 1)\n", " url = f\"https://{self.school.region}/sts_sci_md00_001.do?\"\n", " url += f\"schulCode={self.school.code}&\"\n", " url += f\"schulCrseScCode={self.school.type}&\"\n", " url += f\"schulKndScCode={self.school.type:02d}&\"\n", " url += f\"ay={today.year}&\"\n", " url += f\"mm={today.month:02d}\"\n", " print(url)\n", " return url" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# 교육청 코드\n", "school.region" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# 학교종류\n", "school.type" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# 개별 학교코드\n", "school.code" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"schulKndScCode={:02d}\".format(4)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "parser = MenuParser(school)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "parser." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **일일수집 해외 위해정보**\n", "http://www.foodsafetykorea.go.kr/riskinfo/board-collect-list.do\n", "1. Selenium 으로 해당 이벤트를 클릭한 뒤\n", "1. 상세페이지에서 해당 xls 파일 다운로드 받기" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "temp = datetime.today()\n", "temp.strftime('%Y%m%d')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "datetime.strftime(temp, '%Y%m%d')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **2 알라딘 중고책**\n", "api 크롤링을 사용한 데이터 수집" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "query = {\n", " \"KeyWord\":\"파이썬\", #\"%C6%C4%C0%CC%BD%E3\",\n", " \"ViewType\":\"Detail\",\n", " \"SortOrder\":5, # 5:출시일순, 11:등록순\n", " \"ViewRowCount\":50,\n", " \"page\":2, \n", "}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "url = \"https://www.aladin.co.kr/search/wsearchresult.aspx?SearchTarget=UsedStore\"\n", "from urllib import parse, request\n", "params = parse.urlencode(query).encode('cp949') # Params 인코딩 변환\n", "resp = request.Request(url, params) # url 주소값 생성\n", "resp = request.urlopen(resp).read() # url 을 활용한 response 수집\n", "resp = parse.quote_plus(resp) # response 의 byte 를 string 변환\n", "resp = parse.unquote_plus(resp, encoding='cp949') # string 인코딩 변환 (default='utf-8')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "with open('book.html','w') as f:\n", " f.write(resp)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# decoding (기본값 utf-8)\n", "parse.unquote_plus('%C4%F6%C6%AE', encoding='cp949')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# encodings\n", "parse.quote_plus(\"퀀트\", encoding='cp949')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#
\n", "#