{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Get BTC data (FGI, GST, BHP)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For my project to predict BTC price, i had to get data from different sources with different methods (using package, webscraping and API).
\n", "So here are the lines of code i'm using to **get a CSV file** with :\n", "- Crypto Fear & Greed Index (FGI)\n", "- Bitcoin Google Searches Trend (GST)\n", "- Bitcoin Historical Price (BHP)" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "## Librairies Importation" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "hidden": true }, "outputs": [], "source": [ "# Main\n", "import pandas as pd\n", "import numpy as np\n", "import math\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "%matplotlib inline\n", "\n", "# Web\n", "import requests\n", "from bs4 import BeautifulSoup\n", "#from IPython.display import display_html\n", "\n", "# More\n", "from pandas.plotting import register_matplotlib_converters\n", "register_matplotlib_converters()\n", "pd.set_option('display.max_rows', 400)\n", "\n", "# Date format (ex: 190913)\n", "from datetime import date\n", "#today = date.today().strftime(\"%Y%m%d\")[2:]\n", "\n", "# PyTrends\n", "from pytrends.request import TrendReq\n", "pytrend = TrendReq()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dataframe Preparation" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "### Dataframe 1: Crypto fear & greed index (fgi)" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "Since February 2018, [Alternative.me](https://alternative.me/crypto/fear-and-greed-index/) provides a fear and greed index based on:
\n", "- Volatility (25 %)\n", "- Market Momentum/Volume (25%)\n", "- Social Media (15%)\n", "- Surveys (15%)\n", "- BTC Dominance (10%)\n", "- Trends (10%)\n", "
\n", "\n", ">Note: The API returns a CSV code, but i didn't find how to easily read/convert it.
\n", ">PS: There are some duplicates and missing values in these data. See below in 1.2.4
" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ScoreCategory
Date
2018-02-0130Fear
2018-02-0215Extreme Fear
2018-02-0340Fear
2018-02-0424Extreme Fear
2018-02-0511Extreme Fear
\n", "
" ], "text/plain": [ " Score Category\n", "Date \n", "2018-02-01 30 Fear\n", "2018-02-02 15 Extreme Fear\n", "2018-02-03 40 Fear\n", "2018-02-04 24 Extreme Fear\n", "2018-02-05 11 Extreme Fear" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "url = 'https://api.alternative.me/fng/?limit=0&format=csv&date_format=cn'\n", "r = requests.get(url)\n", "fgi = BeautifulSoup(r.text, 'lxml')\n", "\n", "fgi = str(fgi)\n", "fgi = fgi[fgi.find('date\\n')+5:fgi.find('\\n\\t]')] \n", "fgi = fgi.replace('\\n',',').split(',')\n", "\n", "fgi = pd.DataFrame({'Date':fgi[::3],'Score':fgi[1:][::3],'Category':fgi[2:][::3]})\n", "fgi['Date'] = pd.to_datetime(fgi['Date'])\n", "fgi = fgi.set_index('Date').sort_index()\n", "\n", "fgi.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Dataframe 2: Bitcoin Google searches trend (gst)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In a recent notebook, i discover **Pytrends** (https://github.com/GeneralMills/pytrends) which is a great tool for web scraping Google Trends.
\n", "(Thanks to its team to share it)\n", "\n", "> You can get data from 2004. But note that if you choose a **too large period (~8 months)**, Google Trends returns weekly score, not daily.
\n", "\n", "The period from today (2019-09-13) to first Fear&Greed Index data (2018-02-01) exceeds this limit to get daily data. So the function ```interest_over_time()``` returns weekly data. But here is the magic and what we interest in. Pytrends gives a function `build_payload()` in order to set timeframe, localisation, etc.. (more on the [documentation](https://github.com/GeneralMills/pytrends)).
\n", "
(More: PyTrends has a function ```get_historical_interest()``` to get hours data, and many others possibilities.)

\n", "**What we need to do**, it to set period time of 8 months in order to have our daily trend score." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bitcoin tradingisPartial
date
2018-02-0182False
2018-02-0298False
2018-02-0364False
2018-02-0467False
2018-02-0583False
\n", "
" ], "text/plain": [ " bitcoin trading isPartial\n", "date \n", "2018-02-01 82 False\n", "2018-02-02 98 False\n", "2018-02-03 64 False\n", "2018-02-04 67 False\n", "2018-02-05 83 False" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Keywords list (max 5 words/expressions)\n", "kw_list = ['bitcoin trading']\n", "\n", "# 2018-02-01 to 2018-05-11\n", "pytrend.build_payload(kw_list, cat=0, timeframe='2018-02-01 2018-05-11', geo='', gprop='')\n", "gst_1 = pytrend.interest_over_time()\n", "\n", "# 2018-05-12 to 2019-01-22\n", "pytrend.build_payload(kw_list, cat=0, timeframe='2018-05-12 2019-01-22', geo='', gprop='')\n", "gst_2 = pytrend.interest_over_time()\n", "\n", "# 2019-01-23 to 2019-09-23\n", "pytrend.build_payload(kw_list, cat=0, timeframe='2019-01-23 2019-09-23', geo='', gprop='')\n", "gst_3 = pytrend.interest_over_time()\n", "\n", "# Concat\n", "frames = [gst_1, gst_2, gst_3]\n", "gst = pd.concat(frames)\n", "\n", "gst.head()" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "### Dataframe 3: Bitcoin historical price (bhp)" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "No surprise here, i already webscrap a few times the BTC price on [coinmarketcap.com](https://coinmarketcap.com/).
" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "hidden": true }, "outputs": [], "source": [ "def get_btc_price(start_date='20170101', end_date='20190913'):\n", " ''' \n", " Webscraping BTC price on coinmarketcap.com\n", " ''' \n", " url = f'https://coinmarketcap.com/currencies/bitcoin/historical-data/?start={start_date}&end={end_date}'\n", " btc = BeautifulSoup(requests.get(url).text, 'lxml')\n", " btc = pd.read_html(str(btc.find_all('table', class_='table')[0]))[0]\n", " btc['Date'] = pd.to_datetime(btc['Date'])\n", " \n", " return btc" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Open*HighLowClose**VolumeMarket Cap
Date
2018-02-0110237.3010288.808812.289170.549959400448154428564694
2018-02-029142.289142.287796.498830.7512726899712148725283812
2018-02-038852.129430.758251.639174.917263790080154540000411
2018-02-049175.709334.878031.228277.017073549824139433682759
2018-02-058270.548364.846756.686955.279285289984117184385122
\n", "
" ], "text/plain": [ " Open* High Low Close** Volume Market Cap\n", "Date \n", "2018-02-01 10237.30 10288.80 8812.28 9170.54 9959400448 154428564694\n", "2018-02-02 9142.28 9142.28 7796.49 8830.75 12726899712 148725283812\n", "2018-02-03 8852.12 9430.75 8251.63 9174.91 7263790080 154540000411\n", "2018-02-04 9175.70 9334.87 8031.22 8277.01 7073549824 139433682759\n", "2018-02-05 8270.54 8364.84 6756.68 6955.27 9285289984 117184385122" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bhp = get_btc_price(start_date='20180201', end_date='20190923')\n", "bhp = bhp.set_index('Date').sort_index()\n", "\n", "bhp.head()" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "### Dataframe 1 (fgi) + Dataframe 2 (gst) + Dataframe 3 (bhp)" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "Let's join our 3 dataframes. For each, we need to check the period, and the shape.
\n", "A necessary step, because we will find duplicates and missing values in one dataframe." ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "hidden": true, "scrolled": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ScoreCategorybitcoin tradingisPartialOpen*HighLowClose**VolumeMarket Cap
Date
2018-02-0130Fear82False10237.3010288.808812.289170.549959400448154428564694
2018-02-0215Extreme Fear98False9142.289142.287796.498830.7512726899712148725283812
2018-02-0340Fear64False8852.129430.758251.639174.917263790080154540000411
2018-02-0424Extreme Fear67False9175.709334.878031.228277.017073549824139433682759
2018-02-0511Extreme Fear83False8270.548364.846756.686955.279285289984117184385122
\n", "
" ], "text/plain": [ " Score Category bitcoin trading isPartial Open* \\\n", "Date \n", "2018-02-01 30 Fear 82 False 10237.30 \n", "2018-02-02 15 Extreme Fear 98 False 9142.28 \n", "2018-02-03 40 Fear 64 False 8852.12 \n", "2018-02-04 24 Extreme Fear 67 False 9175.70 \n", "2018-02-05 11 Extreme Fear 83 False 8270.54 \n", "\n", " High Low Close** Volume Market Cap \n", "Date \n", "2018-02-01 10288.80 8812.28 9170.54 9959400448 154428564694 \n", "2018-02-02 9142.28 7796.49 8830.75 12726899712 148725283812 \n", "2018-02-03 9430.75 8251.63 9174.91 7263790080 154540000411 \n", "2018-02-04 9334.87 8031.22 8277.01 7073549824 139433682759 \n", "2018-02-05 8364.84 6756.68 6955.27 9285289984 117184385122 " ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_fgi = fgi.copy()\n", "df_gst = gst.copy()\n", "df_bhp = bhp.copy()\n", "\n", "#print(df_fgi.index[-1]) >>> 2019-09-23 (+3 days)\n", "#print(df_gst.index[-1]) >>> 2019-09-20\n", "#print(df_bhp.index[-1]) >>> 2019-09-22 (+2 day)\n", "\n", "df_fgi = df_fgi[:-3]\n", "df_gst = df_gst\n", "df_bhp = df_bhp[:-2]\n", "\n", "#print(df_fgi.shape) >>> (596, 2) (Something weird with this dataframe)\n", "#print(df_gst.shape) >>> (597, 2)\n", "#print(df_bhp.shape) >>> (597, 6)\n", "\n", "#df_fgi.reset_index().shape >>> (596, 3)\n", "#df_fgi.reset_index().drop_duplicates(subset='Date', keep='last').shape >>> (594, 3)\n", "\n", "# Found two duplicate values: 2019-02-17, 2019-01-30\n", "df_fgi = df_fgi.reset_index().drop_duplicates(subset='Date', keep='last')\n", "\n", "# Found three missing values: 2018-04-14, 2018-04-15, 2018-04-16\n", "# I had to create realistic values\n", "# And reported the bug to API team, maybe it will be fix\n", "val = pd.DataFrame({'Date':['2018-04-14', '2018-04-15', '2018-04-16'],'Score':['25', '26', '24'],'Category':['Extreme Fear', 'Fear', 'Extreme Fear']})\n", "val['Date'] = pd.to_datetime(val['Date'])\n", "df_fgi = df_fgi.append(val).set_index('Date').sort_index()\n", "\n", "# Okay\n", "df = df_fgi.join(df_gst).join(df_bhp)\n", "\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "## Export dataframe as CSV file" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "# Let's save this code (date format 190913)\n", "today = date.today().strftime(\"%Y%m%d\")[2:]\n", "\n", "df.to_csv(f'{today}_bitcoin_predictions_data.csv')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }