{ "cells": [ { "cell_type": "markdown", "id": "95f0a171", "metadata": {}, "source": [ "(dates-and-times)=\n", "# Dates and Times\n", "\n", "## Introduction\n", "\n", "This chapter will show you how to work with dates and times in Python. At first glance, dates and times seem simple. You use them all the time in your regular life, and they don't seem to cause much confusion. However, the more you learn about dates and times, the more complicated they seem to get. To warm up, try these three seemingly simple questions:\n", "\n", "- Does every year have 365 days?\n", "- Does every day have 24 hours?\n", "- Does every minute have 60 seconds?\n", "\n", "I'm sure you know that not every year has 365 days, but do you know the full rule for determining if a year is a leap year?\n", "\n", "You might have remembered that many parts of the world use daylight savings time (DST), so that some days have 23 hours, and others have 25. You might not have known that some minutes have 61 seconds because every now and then leap seconds are added because the Earth's rotation is gradually slowing down.\n", "\n", "Dates and times are hard because they have to reconcile two physical phenomena (the rotation of the Earth and its orbit around the sun) with a whole raft of geopolitical phenomena including months, time zones, and DST. \n", "\n", "This chapter won't teach you every last detail about dates and times, but it will give you a solid grounding of practical skills that will help you with common data analysis challenges. In particular, one code task related to time that we won’t cover here includes how to run scripts or functions at a given frequency, ie how to schedule jobs.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "51a55374", "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "# remove cell\n", "import matplotlib_inline.backend_inline\n", "import matplotlib.pyplot as plt\n", "\n", "# Plot settings\n", "plt.style.use(\"https://github.com/aeturrell/python4DS/raw/main/plot_style.txt\")\n", "matplotlib_inline.backend_inline.set_matplotlib_formats(\"svg\")" ] }, { "cell_type": "markdown", "id": "17575f3a", "metadata": {}, "source": [ "### Prerequisites\n", "\n", "You will need to install the **seaborn** package for this chapter. This chapter uses the next generation version of **seaborn**, which can be installed by running the following on the command line (aka in the terminal): \n", "\n", "```bash\n", "pip install --pre seaborn\n", "```\n", "\n", "We will also be using the **pandas** package and numerical package **numpy**." ] }, { "cell_type": "markdown", "id": "ff99055a", "metadata": {}, "source": [ "## Time in Python\n", "\n", "A point in time as represented in data science is composed of a clock time and a date. These two elements are brought together as a *datetime*.\n", "\n", "The datetime object is the fundamental time object in Python. It’s useful to know about these before moving on to datetime operations using **pandas** (which you’re far more likely to use in practice). Python's *datetime* objects capture the year, month, day, hour, second, and microsecond. Let’s import the class that deals with datetimes (whose objects are of type datetime.datetime) and take a look at it." ] }, { "cell_type": "code", "execution_count": null, "id": "84829a6b", "metadata": {}, "outputs": [], "source": [ "from datetime import datetime\n", "\n", "now = datetime.now()\n", "print(now)" ] }, { "cell_type": "markdown", "id": "70893642", "metadata": {}, "source": [ "Most people will be more used to working with day-month-year, while some people even have month-day-year, which clearly makes no sense at all! But note datetime follows [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601), the international standard for datetimes that has year-month-day-hrs:mins:seconds, with hours in the 24 hour clock format. This is the format you should use when coding too." ] }, { "cell_type": "markdown", "id": "5d224fed", "metadata": {}, "source": [ "We can see that the variable we created has methods such as `year`, `month`, `day`, and so on, down to `microsecond`. When calling these methods on the `now` object we created, they will return the relevant detail. \n", "\n", "```{admonition} Exercise\n", "Try calling the year, month, and day functions on an instance of `datetime.now()`.\n", "```\n", "\n", "Note that, once created, `now` does not refresh itself: it's frozen at the time that it was made." ] }, { "cell_type": "markdown", "id": "bc5604b3", "metadata": {}, "source": [ "## Creating Datetimes\n", "\n", "### From Individual Components\n", "\n", "To create a datetime using given numerical information the command is:" ] }, { "cell_type": "code", "execution_count": null, "id": "fc224a47", "metadata": {}, "outputs": [], "source": [ "specific_datetime = datetime(2019, 11, 28)\n", "print(specific_datetime)" ] }, { "cell_type": "markdown", "id": "5adef873", "metadata": {}, "source": [ "To make clearer and more readable code, you can also call this using keyword arguments: `datetime(year=2019, month=11, day=28)`." ] }, { "cell_type": "markdown", "id": "70382b04", "metadata": {}, "source": [ "### From a String\n", "\n", "One of the most common transformations you're likely to need to do when it comes to times is the one from a string, like \"4 July 2002\", to a datetime. You can do this using `datetime.strptime()`. Here's an example:" ] }, { "cell_type": "code", "execution_count": null, "id": "4558d476", "metadata": {}, "outputs": [], "source": [ "date_string = \"16 February in 2002\"\n", "datetime.strptime(date_string, \"%d %B in %Y\")" ] }, { "cell_type": "markdown", "id": "6810206d", "metadata": {}, "source": [ "What's going on? The pattern of the datestring is \"day month 'in' year\". Python's `strptime()` function has codes for the different parts of a datetime (and the different ways they can be expressed). For example, if you had the short version of month instead of the long it would be:\n" ] }, { "cell_type": "code", "execution_count": null, "id": "fb90ac84", "metadata": {}, "outputs": [], "source": [ "date_string = \"16 Feb in 2002\"\n", "datetime.strptime(date_string, \"%d %b in %Y\")" ] }, { "cell_type": "markdown", "id": "6c540ef4", "metadata": {}, "source": [ "Of course, you don't always want to have to worry about the ins and outs of what you're passing in, and the built-in `dateutil` is here for flexible parsing of formats should you need that (explicit is better than implicit though!):" ] }, { "cell_type": "code", "execution_count": null, "id": "7a568f5b", "metadata": {}, "outputs": [], "source": [ "from dateutil.parser import parse\n", "\n", "date_string = \"03 Feb 02\"\n", "print(parse(date_string))\n", "date_string = \"3rd February 2002\"\n", "print(parse(date_string))" ] }, { "cell_type": "markdown", "id": "48cb9b0c", "metadata": {}, "source": [ "What about turning a datetime into a string? We can do that too, courtesy of the same codes." ] }, { "cell_type": "code", "execution_count": null, "id": "d6b5e3c3", "metadata": {}, "outputs": [], "source": [ "now.strftime(\"%A, %m, %Y\")" ] }, { "cell_type": "markdown", "id": "be9ab6f0", "metadata": {}, "source": [ "You can find a close-to-comprehensive list of `strftime` codes at [https://strftime.org/](https://strftime.org/), but they're reproduced in the table below for convenience. \n", "\n", "| Code | Meaning | Example |\n", "|-|-|-|\n", "| %a | Weekday as locale’s abbreviated name. | Mon |\n", "| %A | Weekday as locale’s full name. | Monday |\n", "| %w | Weekday as a decimal number, where 0 is Sunday and 6 is Saturday. | 1 |\n", "| %d | Day of the month as a zero-padded decimal number. | 30 |\n", "| %-d | Day of the month as a decimal number. (Platform specific) | 30 |\n", "| %b | Month as locale’s abbreviated name. | Sep |\n", "| %B | Month as locale’s full name. | September |\n", "| %m | Month as a zero-padded decimal number. | 09 |\n", "| %-m | Month as a decimal number. (Platform specific) | 9 |\n", "| %y | Year without century as a zero-padded decimal number. | 13 |\n", "| %Y | Year with century as a decimal number. | 2013 |\n", "| %H | Hour (24-hour clock) as a zero-padded decimal number. | 07 |\n", "| %-H | Hour (24-hour clock) as a decimal number. (Platform specific) | 7 |\n", "| %I | Hour (12-hour clock) as a zero-padded decimal number. | 07 |\n", "| %-I | Hour (12-hour clock) as a decimal number. (Platform specific) | 7 |\n", "| %p | Locale’s equivalent of either AM or PM. | AM |\n", "| %M | Minute as a zero-padded decimal number. | 06 |\n", "| %-M | Minute as a decimal number. (Platform specific) | 6 |\n", "| %S | Second as a zero-padded decimal number. | 05 |\n", "| %-S | Second as a decimal number. (Platform specific) | 5 |\n", "| %f | Microsecond as a decimal number, zero-padded on the left. | 000000 |\n", "| %z | UTC offset in the form +HHMM or -HHMM (empty string if the the object is naive). | |\n", "| %Z | Time zone name (empty string if the object is naive). | |\n", "| %j | Day of the year as a zero-padded decimal number. | 273 |\n", "| %-j | Day of the year as a decimal number. (Platform specific) | 273 |\n", "| %U | Week number of the year (Sunday as the first day of the week) as a zero padded decimal number. | 39 |\n", "| %W | Week number of the year (Monday as the first day of the week) as a decimal number. | 39 |\n", "| %c | Locale’s appropriate date and time representation. | Mon Sep 30 07:06:05 2013 |\n", "| %x | Locale’s appropriate date representation. | 09/30/13 |\n", "| %X | Locale’s appropriate time representation. | 07:06:05 |\n", "| %% | A literal '%' character. | % |" ] }, { "cell_type": "markdown", "id": "fc02b84d", "metadata": {}, "source": [ "## Operations on Datetimes\n", "\n", "Many of the operations you'd expect to just work with datetimes, do for example:" ] }, { "cell_type": "code", "execution_count": null, "id": "39ba17e6", "metadata": {}, "outputs": [], "source": [ "now > specific_datetime" ] }, { "cell_type": "markdown", "id": "59077f81", "metadata": {}, "source": [ "As well as recording or comparing a *single* datetime, there are plenty of occasions when we'll be interested in *differences* in datetimes. Let's create one and then check its type." ] }, { "cell_type": "code", "execution_count": null, "id": "fda57a44", "metadata": {}, "outputs": [], "source": [ "time_diff = now - datetime(year=2020, month=1, day=1)\n", "print(time_diff)" ] }, { "cell_type": "markdown", "id": "9b7749fc", "metadata": {}, "source": [ "This is in the format of days, hours, minutes, seconds, and microseconds. Let's check the type with `type()`:" ] }, { "cell_type": "code", "execution_count": null, "id": "e1e25736", "metadata": {}, "outputs": [], "source": [ "type(time_diff)" ] }, { "cell_type": "markdown", "id": "97c8d655", "metadata": {}, "source": [ "This is of type `datetime.timedelta`." ] }, { "cell_type": "markdown", "id": "c55fb124", "metadata": {}, "source": [ "## Timezones\n", "\n" ] }, { "cell_type": "markdown", "id": "acaaf88c", "metadata": {}, "source": [ "Date and time objects may be categorized as aware or naive depending on whether or not they include timezone information; an aware object can locate itself relative to other aware objects, but a naive object does not contain enough information to unambiguously locate itself relative to other date/time objects. So far we've been working with naive datetime objects.\n", "\n", "The [**pytz**](https://pypi.org/project/pytz/) package can help you work with time zones. It has two main use cases: i) localise timezone-naive datetimes so that they become aware, ie have a timezone and ii) convert a datetimne in one timezone to another timezone.\n", "\n", "The default timezone for coding is UTC. ‘UTC’ is Coordinated Universal Time. It is a successor to, but distinct from, Greenwich Mean Time (GMT) and the various definitions of Universal Time. UTC is now the worldwide standard for regulating clocks and time measurement.\n", "\n", "All other timezones are defined relative to UTC, and include offsets like UTC+0800 - hours to add or subtract from UTC to derive the local time. No daylight saving time occurs in UTC, making it a useful timezone to perform date arithmetic without worrying about the confusion and ambiguities caused by daylight saving time transitions, your country changing its timezone, or mobile computers that roam through multiple timezones." ] }, { "cell_type": "markdown", "id": "58bedcdc", "metadata": {}, "source": [ "## Vectorised Datetimes \n", "\n", "Now we come to vectorised operations on datetimes using the powerful **numpy** packages (and this is what is used by **pandas**). **numpy** has its own version of datetime, called `np.datetime64`, and it's very efficient at scale. Let's see it in action:" ] }, { "cell_type": "code", "execution_count": null, "id": "ed526fbc", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "date = np.array(\"2020-01-01\", dtype=np.datetime64)\n", "date" ] }, { "cell_type": "markdown", "id": "558f758b", "metadata": {}, "source": [ "The 'D' tells us that the smallest unit here is days. We can easily create a vector of dates from this object:" ] }, { "cell_type": "code", "execution_count": null, "id": "537895c5", "metadata": {}, "outputs": [], "source": [ "date + range(32)" ] }, { "cell_type": "markdown", "id": "ca81358f", "metadata": {}, "source": [ "Note how the last day rolls over into the next month.\n", "\n", "If you are creating a datetime with more precision than day, **numpy** will figure it out from the input, for example this gives resolution down to seconds." ] }, { "cell_type": "code", "execution_count": null, "id": "cd7a15e3", "metadata": {}, "outputs": [], "source": [ "np.datetime64(\"2020-01-01 09:00\")" ] }, { "cell_type": "markdown", "id": "c0a974aa", "metadata": {}, "source": [ "One word of warning with **numpy** and datetimes though: the more precise you go, and you can go down to femtoseconds ($10^{-15}$ seconds), the more precise you go the smaller the range of dates you can hit. A popular choice of precision is `datetime64[ns]`, which can encode times from 1678 AD to 2262 AD. Working with seconds gets you 2.9$\\times 10^9$ BC to 2.9$\\times 10^9$ AD." ] }, { "cell_type": "markdown", "id": "417ec546", "metadata": {}, "source": [ "## Working with Datetimes in Data Frames" ] }, { "cell_type": "markdown", "id": "82a4a3f3", "metadata": {}, "source": [ "[**pandas**](https://pandas.pydata.org/) is the workhorse of time series analysis in Python. The basic object is a *timestamp*. The `pd.to_datetime()` function creates timestamps from strings that could reasonably represent datetimes. Let's see an example of using `pd.to_datetime()` to create a timestamp and then take a look at it." ] }, { "cell_type": "code", "execution_count": null, "id": "48f0e9c6", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "date = pd.to_datetime(\"16th of February, 2020\")\n", "date" ] }, { "cell_type": "markdown", "id": "8b083619", "metadata": {}, "source": [ "This is of type `Timestamp` and you can see that it has many of the same properties as the built-in Python `datetime.datetime` class from the previous chapter. As with that, the default setting for `tz` (timezone) and `tzinfo` is `None`. There are some extra properties, though, such as `freq` for frequency, which will be very useful when it comes to manipulating time *series* as opposed to just one or two datetimes." ] }, { "cell_type": "markdown", "id": "dfdf18bc", "metadata": {}, "source": [ "### Creating and Using Time Series\n", "\n", "There are two main scenarios in which you might be creating time series using **pandas**: i) creating one from scratch or ii) reading in data from a file. Let's look at a few ways to do i) first. \n", "\n", "You can create a time series with **pandas** by taking a date as created above and extending it using **pandas** timedelta function:" ] }, { "cell_type": "code", "execution_count": null, "id": "462b26da", "metadata": {}, "outputs": [], "source": [ "date + pd.to_timedelta(np.arange(12), \"D\")" ] }, { "cell_type": "markdown", "id": "3ba587ff", "metadata": {}, "source": [ "This has created a datetime index of type `datetime65[ns]` (remember, an index is a special type of **pandas** column), where \"ns\" stands for nano-second resolution.\n", "\n", "Another method is to create a range of dates (pass a frequency using the `freq=` keyword argument):" ] }, { "cell_type": "code", "execution_count": null, "id": "10e71325", "metadata": {}, "outputs": [], "source": [ "pd.date_range(start=\"2018/1/1\", end=\"2018/1/8\")" ] }, { "cell_type": "markdown", "id": "3f6c8a4f", "metadata": {}, "source": [ "Another way to create ranges is to specify the number of periods and the frequency:" ] }, { "cell_type": "code", "execution_count": null, "id": "291ace2c", "metadata": {}, "outputs": [], "source": [ "pd.date_range(\"2018-01-01\", periods=3, freq=\"H\")" ] }, { "cell_type": "markdown", "id": "ce8c06d8", "metadata": {}, "source": [ "Following the discussion of the previous chapter on timezones, you can also localise timezones directly in **pandas** data frames:\n" ] }, { "cell_type": "code", "execution_count": null, "id": "6703682c", "metadata": {}, "outputs": [], "source": [ "dti = pd.date_range(\"2018-01-01\", periods=3, freq=\"H\").tz_localize(\"UTC\")\n", "dti.tz_convert(\"US/Pacific\")" ] }, { "cell_type": "markdown", "id": "4d3e7668", "metadata": {}, "source": [ "Now let's see how to turn data that has been read in with a non-datetime type into a vector of datetimes. This happens *all the time* in practice. We'll read in some data on job vacancies for information and communication jobs, ONS code UNEM-JP9P, and then try to wrangle the given \"date\" column into a **pandas** datetime column." ] }, { "cell_type": "code", "execution_count": null, "id": "dd00df7f", "metadata": {}, "outputs": [], "source": [ "import requests\n", "\n", "url = \"https://api.ons.gov.uk/timeseries/JP9P/dataset/UNEM/data\"\n", "\n", "# Get the data from the ONS API:\n", "df = pd.DataFrame(pd.json_normalize(requests.get(url).json()[\"months\"]))\n", "df[\"value\"] = pd.to_numeric(df[\"value\"])\n", "df = df[[\"date\", \"value\"]]\n", "df = df.rename(columns={\"value\": \"Vacancies (ICT), thousands\"})\n", "df.head()" ] }, { "cell_type": "markdown", "id": "1be10bf8", "metadata": {}, "source": [ "We have the data in. Let's look at the column types that arrived." ] }, { "cell_type": "code", "execution_count": null, "id": "247d9725", "metadata": {}, "outputs": [], "source": [ "df.info()" ] }, { "cell_type": "markdown", "id": "71bbbb02", "metadata": {}, "source": [ "This is the default 'object' type, but we want the date column to have `datetime64[ns]`, which is a datetime type. Again, we use `pd.to_datetime()`:" ] }, { "cell_type": "code", "execution_count": null, "id": "b90f8038", "metadata": {}, "outputs": [], "source": [ "df[\"date\"] = pd.to_datetime(df[\"date\"])\n", "df[\"date\"].head()" ] }, { "cell_type": "markdown", "id": "ca0b1727", "metadata": {}, "source": [ "In this case, the conversion from the format of data that was put in of \"2001 MAY\" to datetime worked out-of-the-box. `pd.to_datetime` will always take an educated guess as to the format, but it won't always work out.\n", "\n", "What happens if we have a more tricky-to-read-in datetime column? This frequently occurs in practice so it's well worth exploring an example. Let's create some random data with dates in an unusual format with month first, then year, then day, eg \"1, '19, 29\" and so on." ] }, { "cell_type": "code", "execution_count": null, "id": "05d056ae", "metadata": {}, "outputs": [], "source": [ "small_df = pd.DataFrame({\"date\": [\"1, '19, 22\", \"1, '19, 23\"], \"values\": [\"1\", \"2\"]})\n", "small_df[\"date\"]" ] }, { "cell_type": "markdown", "id": "b1e6f80c", "metadata": {}, "source": [ "Now, if we were to run this via `pd.to_datetime` with no further input, it would misinterpret, for example, the first date as `2022-01-19`. So we must pass a bit more info to `pd.to_datetime` to help it out. We can pass a `format=` keyword argument with the format that the datetime takes. Here, we'll use `%m` for month in number format, `%y` for year in 2-digit format, and `%d` for 2-digit day. We can also add in the other characters such as `'` and `,`. You can find a list of datetime format identifiers above or over at [https://strftime.org/](https://strftime.org/)." ] }, { "cell_type": "code", "execution_count": null, "id": "514c9052", "metadata": {}, "outputs": [], "source": [ "pd.to_datetime(small_df[\"date\"], format=\"%m, '%y, %d\")" ] }, { "cell_type": "markdown", "id": "56481fd3", "metadata": {}, "source": [ "### Datetime Offsets\n", "\n", "Our data, currently held in `df`, were read in as if they were from the *start* of the month but these data refer to the month that has passed and so should be for the *end* of the month. Fortunately, we can change this using a time offset." ] }, { "cell_type": "code", "execution_count": null, "id": "ac3addbc", "metadata": {}, "outputs": [], "source": [ "df[\"date\"] = df[\"date\"] + pd.offsets.MonthEnd()\n", "df.head()" ] }, { "cell_type": "markdown", "id": "61ef45b4", "metadata": {}, "source": [ "While we used the `MonthEnd` offset here, there are many different offsets available. You can find a [full table of date offsets here](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects)." ] }, { "cell_type": "markdown", "id": "5afd54be", "metadata": {}, "source": [ "### The `.dt` accessor\n", "\n", "When you have a datetime column, you can use the `.dt` accessor to grab lots of useful information from it such as the `minute`, `month`, and so on. Some that are functions, rather than just accessors of underlying properties, are followed by brackets, `()`, because they are functions. Here are a few useful examples:" ] }, { "cell_type": "code", "execution_count": null, "id": "a6c3d2d9", "metadata": {}, "outputs": [], "source": [ "print(\"Using `dt.day_name()`\")\n", "print(df[\"date\"].dt.day_name().head())\n", "print(\"Using `dt.isocalendar()`\")\n", "print(df[\"date\"].dt.isocalendar().head())\n", "print(\"Using `dt.month`\")\n", "print(df[\"date\"].dt.month.head())" ] }, { "cell_type": "markdown", "id": "a40f982f", "metadata": {}, "source": [ "### Creating a datetime Index and Setting the Frequency\n", "\n", "For the subsequent parts, we'll set the datetime column to be the index of the data frame. *This is the standard setup you will likely want to use when dealing with time series.*" ] }, { "cell_type": "code", "execution_count": null, "id": "e0a4f68d", "metadata": {}, "outputs": [], "source": [ "df = df.set_index(\"date\")\n", "df.head()" ] }, { "cell_type": "markdown", "id": "8d5017ed", "metadata": {}, "source": [ "Now, if we look at the first few entries of the index of data frame (a datetime index) using `head` as above, we'll see that the `freq=` parameter is set as `None`." ] }, { "cell_type": "code", "execution_count": null, "id": "acf1ae60", "metadata": {}, "outputs": [], "source": [ "df.index[:5]" ] }, { "cell_type": "markdown", "id": "3bd0fe54", "metadata": {}, "source": [ "This can be set for the whole data frame using the `asfreq()` function:" ] }, { "cell_type": "code", "execution_count": null, "id": "9146c99d", "metadata": {}, "outputs": [], "source": [ "df = df.asfreq(\"M\")\n", "df.index[:5]" ] }, { "cell_type": "markdown", "id": "8f024684", "metadata": {}, "source": [ "Although most of the time it doesn't matter about the fact that `freq=None`, some aggregation operations need to know the frequency of the time series in order to work and it's good practice to set it if your data *are* regular. You can use `asfreq` to go from a higher frequency to a lower frequency too: the last entry from the higher frequency that aligns with the lower frequency will be taken, for example in going from months to years, December's value would be used.\n", "\n", "Note that trying to set the frequency when your datetime index doesn't match up to a particular frequency will cause errors or problems.\n", "\n", "A few useful frequencies to know about are in the table below; all of these can be used with `pd.to_datetime()` too.\n", "\n", "| Code | Represents |\n", "|-------|---------------------------------------------------------------------|\n", "| D | Calendar day |\n", "| W | Weekly |\n", "| M | Month end |\n", "| Q | Quarter end |\n", "| A | Year end |\n", "| H | Hours |\n", "| T | Minutes |\n", "| S | Seconds |\n", "| B | Business day |\n", "| BM | Business month end |\n", "| BQ | Business quarter end |\n", "| BA | Business year end |\n", "| BH | Business hours |\n", "| MS | Month start |\n", "| QS | Quarter start |\n", "| W-SUN | Weeks beginning with Sunday (similar for other days) |\n", "| 2M | Every 2 months (works with other combinations of numbers and codes) |" ] }, { "cell_type": "markdown", "id": "9081a485", "metadata": {}, "source": [ "## Making Quick Time Series Plots\n", "\n", "Having managed to put your time series into a data frame, perhaps converting a column of type string into a colume of type datetime in the process, you often just want to see the thing! We can achieve this using the `plot()` command, as long as we have a datetime index.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "b4c5f841", "metadata": {}, "outputs": [], "source": [ "df.plot();" ] }, { "cell_type": "markdown", "id": "e8ae2aa0", "metadata": {}, "source": [ "## Resampling, Rolling, and Shifting\n", "\n", "Now our data have a *datetime index*, some common time series operations are made very easy for us.\n", "\n", "### Resampling\n", "\n", "Quite frequently, there is a situation in which one would like to change the frequency of a given time series. A time index-based data frame makes this easy via the `resample()` function. `resample()` must be told *how* you'd like to resample the data, for example via the mean or median. Here's an example resampling the monthly data to annual and taking the mean:" ] }, { "cell_type": "code", "execution_count": null, "id": "e56ba5c4", "metadata": {}, "outputs": [], "source": [ "df.resample(\"A\").mean()" ] }, { "cell_type": "markdown", "id": "d320a934", "metadata": {}, "source": [ "As resample is just a special type of aggregation, it can work with all of the usual functions that aggregations do, including in-built functions or user-defined functions." ] }, { "cell_type": "code", "execution_count": null, "id": "fbbbcdff", "metadata": {}, "outputs": [], "source": [ "df.resample(\"5A\").agg([\"mean\", \"std\"]).head()" ] }, { "cell_type": "markdown", "id": "c3b45c36", "metadata": {}, "source": [ "Resampling can go up in frequency (up-sampling) as well as down, but we no longer need to choose an aggregation function, we must now choose how we'd like to fill in the gaps for the frequencies we didn't have in the original data. In the example below, they are just left as NaNs." ] }, { "cell_type": "code", "execution_count": null, "id": "9a48a45f", "metadata": {}, "outputs": [], "source": [ "df.resample(\"D\").asfreq()" ] }, { "cell_type": "markdown", "id": "c08e1dad", "metadata": {}, "source": [ "Options to fill in missing time series data include using `bfill` or `ffill` to fill in the blanks based on the next or last available value, respectively, or `interpolate()` (note how only the first 3 NaNs are replaced using the `limit` keyword argument):" ] }, { "cell_type": "code", "execution_count": null, "id": "d3ac1789", "metadata": {}, "outputs": [], "source": [ "df.resample(\"D\").interpolate(method=\"linear\", limit_direction=\"forward\", limit=3)[:6]" ] }, { "cell_type": "markdown", "id": "3da716ee", "metadata": {}, "source": [ "We can see the differences between the filling methods more clearly in this stock market data, following a chart by Jake Vanderplas." ] }, { "cell_type": "code", "execution_count": null, "id": "51647c56", "metadata": {}, "outputs": [], "source": [ "# Get stock market data\n", "import pandas_datareader as web\n", "\n", "xf = web.DataReader(\"AAPL\", \"stooq\", start=\"2017-01-01\", end=\"2019-06-01\")\n", "xf = xf.sort_index()" ] }, { "cell_type": "code", "execution_count": null, "id": "fa0c9973", "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots()\n", "data = xf.iloc[:10, 3]\n", "cycle = ax._get_lines.prop_cycler\n", "\n", "data.asfreq(\"D\").plot(ax=ax, marker=\"o\", linestyle=\"None\", zorder=3)\n", "data.asfreq(\"D\", method=\"bfill\").plot(\n", " ax=ax, style=\"-.o\", lw=1, color=next(cycle)[\"color\"]\n", ")\n", "data.asfreq(\"D\", method=\"ffill\").plot(\n", " ax=ax, style=\"--o\", lw=1, color=next(cycle)[\"color\"]\n", ")\n", "ax.set_ylabel(\"Close ($)\")\n", "ax.legend([\"original\", \"back-fill\", \"forward-fill\"]);" ] }, { "cell_type": "markdown", "id": "235f6b1c", "metadata": {}, "source": [ "### Rolling Window Functions\n", "\n", "The `rolling()` and `ewm()` methods are both rolling window functions. The first includes functions of the sequence\n", "\n", "$$\n", "y_t = f(\\{x_{t-i} \\}_{i=0}^{i=R-1})\n", "$$\n", "\n", "where $R$ is the number of periods to use for the rolling window. For example, if the function is the mean, then $f$ takes the form $\\frac{1}{R}\\displaystyle\\sum_{i=0}^{i=R-1} x_{t-i}$.\n", "\n", "The example below is a 2-period rolling mean:" ] }, { "cell_type": "code", "execution_count": null, "id": "1ddc4fb2", "metadata": {}, "outputs": [], "source": [ "df.rolling(2).mean()" ] }, { "cell_type": "markdown", "id": "488f3c7c", "metadata": {}, "source": [ "The `ewm()` includes the class of functions where data point $x_{t-i}$ has a weight $w_i = (1-\\alpha)^i$. As $0 < \\alpha < 1$, points further back in time are given less weight. For example, an exponentially moving average is given by\n", "\n", "$$\n", "y_t = \\frac{x_t + (1 - \\alpha)x_{t-1} + (1 - \\alpha)^2 x_{t-2} + ...\n", "+ (1 - \\alpha)^t x_{0}}{1 + (1 - \\alpha) + (1 - \\alpha)^2 + ...\n", "+ (1 - \\alpha)^t}\n", "$$\n", "\n", "The example below shows the code for the exponentially weighted moving average:" ] }, { "cell_type": "code", "execution_count": null, "id": "0ea9c8ce", "metadata": {}, "outputs": [], "source": [ "df.ewm(alpha=0.2).mean()" ] }, { "cell_type": "markdown", "id": "8d9eb44c", "metadata": {}, "source": [ "Let's see these methods together on the stock market data." ] }, { "cell_type": "code", "execution_count": null, "id": "0af7b5e6", "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots()\n", "roll_num = 28\n", "alpha = 0.03\n", "xf[\"Close\"].plot(label=\"Raw\", alpha=0.5)\n", "xf[\"Close\"].expanding().mean().plot(label=\"Expanding Average\", style=\":\")\n", "xf[\"Close\"].ewm(alpha=alpha).mean().plot(\n", " label=f\"EWMA ($\\\\alpha=${alpha:.2f})\", style=\"--\"\n", ")\n", "xf[\"Close\"].rolling(roll_num).mean().plot(label=f\"{roll_num}D MA\", style=\"-.\")\n", "ax.legend()\n", "ax.set_ylabel(\"Close ($)\");" ] }, { "cell_type": "markdown", "id": "6f795baa", "metadata": {}, "source": [ "For more tools to analyse stocks, see the [**Pandas TA**](https://twopirllc.github.io/pandas-ta/) package.\n", "\n", "We can also use `rolling()` as an intermediate step in creating more than one type of aggregation:" ] }, { "cell_type": "code", "execution_count": null, "id": "134199ae", "metadata": {}, "outputs": [], "source": [ "roll = xf[\"Close\"].rolling(50, center=True)\n", "\n", "fig, ax = plt.subplots()\n", "m = roll.agg([\"mean\", \"std\"])\n", "m[\"mean\"].plot(ax=ax)\n", "ax.fill_between(m.index, m[\"mean\"] - m[\"std\"], m[\"mean\"] + m[\"std\"], alpha=0.2)\n", "ax.set_ylabel(\"Close ($)\");" ] }, { "cell_type": "markdown", "id": "f3884a6a", "metadata": {}, "source": [ "### Shifting\n", "\n", "Shifting can move series around in time; it's what we need to create leads and lags of time series. Let's create a lead and a lag in the data. Remember that a lead is going to shift the pattern in the data to the left (ie earlier in time), while the lag is going to shift patterns later in time (ie to the right)." ] }, { "cell_type": "code", "execution_count": null, "id": "3078fbb4", "metadata": {}, "outputs": [], "source": [ "lead = 12\n", "lag = 3\n", "orig_series_name = df.columns[0]\n", "df[f\"lead ({lead} months)\"] = df[orig_series_name].shift(-lead)\n", "df[f\"lag ({lag} months)\"] = df[orig_series_name].shift(lag)\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "18b9afb3", "metadata": {}, "outputs": [], "source": [ "df.iloc[100:300, :].plot();" ] } ], "metadata": { "interpreter": { "hash": "9d7534ecd9fbc7d385378f8400cf4d6cb9c6175408a574f1c99c5269f08771cc" }, "jupytext": { "cell_metadata_filter": "-all", "encoding": "# -*- coding: utf-8 -*-", "formats": "md:myst", "main_language": "python" }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" }, "toc-showtags": true }, "nbformat": 4, "nbformat_minor": 5 }