{ "cells": [ { "cell_type": "markdown", "id": "d643ff24", "metadata": {}, "source": [ "# Building Your Intuition About Trendlists" ] }, { "cell_type": "markdown", "id": "27c37c29", "metadata": {}, "source": [ "You've used the `trends` package to reveal a few, simple properties of trends in random sequences:\n", "\n", "* How fast you can decompose the sequences into trends\n", "* How many trends to expect\n", "* How many rotations you need to turn the sequence into a single trend.\n", "\n", "Let's explore `Trends` and `Trendlists` a little more." ] }, { "cell_type": "markdown", "id": "8e42f17b", "metadata": {}, "source": [ "## How Long Is the Longest Trend?" ] }, { "cell_type": "markdown", "id": "1b31f929", "metadata": {}, "source": [ "You should be able to find the longest trend in a random sequence easily enough: \n", "\n", "1. Decompose the sequence into trends.\n", "1. Make a list of the lengths of each one.\n", "1. Find the biggest element in that list.\n", "\n", "You can do the first with `trends.decompose()`, and the last with Python's `max()`.\n", "How do you find the length of a `Trend` object?\n", "\n", "Easy: with its `length` attribute, `Trend.length`.\n", "\n", "Here's that code." ] }, { "cell_type": "code", "execution_count": null, "id": "02b4bb4b", "metadata": {}, "outputs": [], "source": [ "import math\n", "from random import random\n", "\n", "import trends" ] }, { "cell_type": "code", "execution_count": null, "id": "8578263b", "metadata": {}, "outputs": [], "source": [ "def mean_longest_trend(nrands, trials):\n", " total_max_lengths = 0\n", " for _ in range(trials):\n", " seq = [random() for _ in range(nrands)]\n", " trnds = trends.decompose(seq)[0]\n", " lengths = [trnd.length for trnd in trnds]\n", " total_max_lengths += max(lengths)\n", " return total_max_lengths // trials" ] }, { "cell_type": "markdown", "id": "e1bc6aa2", "metadata": {}, "source": [ "Try it out a few times." ] }, { "cell_type": "code", "execution_count": null, "id": "dc1d5194", "metadata": {}, "outputs": [], "source": [ "mean_longest_trend(100, 100)" ] }, { "cell_type": "markdown", "id": "09c682db", "metadata": {}, "source": [ "Look to see how it varies with the length of the sequence." ] }, { "cell_type": "code", "execution_count": null, "id": "be1502c2", "metadata": {}, "outputs": [], "source": [ "def longest_by_sequence_length(max, trials, points):\n", " interval = max // points\n", " longest = {}\n", " for nrands in range(1, max, interval):\n", " longest[nrands] = mean_longest_trend(nrands, trials)\n", " return longest" ] }, { "cell_type": "code", "execution_count": null, "id": "18a96c62", "metadata": { "scrolled": true }, "outputs": [], "source": [ "longests = longest_by_sequence_length(100_000, 100, 10)\n", "longests" ] }, { "cell_type": "markdown", "id": "2f59b917", "metadata": {}, "source": [ "Seems like a nice graph would help. First, just steal code from the earlier tutorial." ] }, { "cell_type": "code", "execution_count": null, "id": "870308f8", "metadata": {}, "outputs": [], "source": [ "from matplotlib import pyplot as plt\n", "import numpy as np\n", "\n", "def slope_and_intercept(x, y):\n", " # fit a least-squares line\n", " m, b = np.polyfit(x, y, deg=1)\n", " # coefficient of determination = correlation ** 2\n", " return (m, b)\n", "\n", "def legend(m, b, r):\n", " equation = f\"y = {m:2.2f}*x + {b:2.2f}\"\n", " rho_sq = \"\\u03c1**2\" # Unicode character for rho is U+03C1. Sadly, the font lacks superscripts.\n", " goodness_of_fit = f\"{rho_sq} = {r**2:2.3f}\"\n", " legend = f\"{equation}\\n{goodness_of_fit}\"\n", " return legend\n", "\n", "def annotate_graph(x, y):\n", " # add in a best-fit line and a legend\n", " m, b = slope_and_intercept(x, y) \n", " r = np.corrcoef(x, y)[0,1]\n", "\n", " # create nparray that spans the x-space\n", " xseq = np.linspace(0, math.ceil(x[-1]), num=100)\n", "\n", " # create a legend\n", " graph_legend = legend(m, b, r)\n", "\n", " # add best-fit line and legend to the plot\n", " plt.plot(xseq, m * xseq + b, color=\"red\", lw=1.5) # best-fit line\n", " plt.text(0, y[-2], graph_legend, color=\"red\") # legend in the urh corner\n", " \n", "\n", "def graph(x, y, x_label, y_label, title): \n", " # Create a scatterplot with a title and axis labels\n", " plt.figure(figsize=(10, 10))\n", " plt.scatter(x, y, s=10, alpha=0.7)\n", " plt.title(title)\n", " plt.xlabel(x_label)\n", " plt.ylabel(y_label)\n", " annotate_graph(x, y) # add the legend and the best-fit line\n", " " ] }, { "cell_type": "markdown", "id": "f8329b52", "metadata": {}, "source": [ "Having defined those, you can graph your data." ] }, { "cell_type": "code", "execution_count": null, "id": "f5acde91", "metadata": {}, "outputs": [], "source": [ "seq_lengths = list(longests.keys())\n", "longest_trends = list(longests.values())\n", "\n", "graph(seq_lengths, longest_trends, \n", " x_label=\"Sequence Length\",\n", " y_label=\"Mean Longest Trend\",\n", " title=\"Longest Trend as a Function of Sequence Length\")" ] }, { "cell_type": "markdown", "id": "bc2e7e8b", "metadata": {}, "source": [ "Sweet, right?\n", "\n", "And what's this tell you?\n", "\n", "The longest trend typically stretches out to nearly two-thirds of the sequence. \n", "That seems surprising, but it's useful to build your intuition." ] }, { "cell_type": "markdown", "id": "ef6df696", "metadata": {}, "source": [ "## What's the Mean of the Longest Trend?" ] }, { "cell_type": "markdown", "id": "1628d138", "metadata": {}, "source": [ "The whole sequence is just random floats between zero and one, so for a sequence of any length, the sequence mean will be $0.5$.\n", "\n", "Intuitively, since you know the longest trend is *also* really long -- over half the whole sequence, you'd guess that it, too, should have that average. Does it? \n", "\n", "It's easy to tweak the previous code slightly, to report both the length and the average of the longest trend?\n", "\n", "\n", "Objects in the class `trends.Trend` have two attributes: `length` and `average`. \n", "Here's how to use these both to collect and then to summarize data." ] }, { "cell_type": "code", "execution_count": null, "id": "053d0184", "metadata": {}, "outputs": [], "source": [ "def longest_trend(nrands):\n", " # return the longest trend in a random sequence\n", " seq = [random() for _ in range(nrands)]\n", " trnds = trends.decompose(seq)[0] # decompose into trends\n", " lengths = [trnd.length for trnd in trnds] # list all trend lengths\n", " longest = lengths.index(max(lengths)) # find which one's biggest\n", " return trnds[longest] # and return that trend, with both attributes." ] }, { "cell_type": "code", "execution_count": null, "id": "824a0024", "metadata": {}, "outputs": [], "source": [ "longest_trend(1000)" ] }, { "cell_type": "markdown", "id": "87eb7308", "metadata": {}, "source": [ "Now use that to describe the *typical* longest trend." ] }, { "cell_type": "code", "execution_count": null, "id": "4598c483", "metadata": {}, "outputs": [], "source": [ "def typical_longest_trend(nrands, trials):\n", " sum_averages = 0\n", " sum_lengths = 0\n", " for _ in range(trials):\n", " longest = longest_trend(nrands)\n", " sum_averages += longest.average\n", " sum_lengths += longest.length\n", " return trends.Trend(round(sum_averages/trials, 3), length=round(sum_lengths/trials))" ] }, { "cell_type": "code", "execution_count": null, "id": "41bed3e3", "metadata": {}, "outputs": [], "source": [ "typical_longest_trend(10_000, 10)" ] }, { "cell_type": "markdown", "id": "ba2370f7", "metadata": {}, "source": [ "## What Do the First and Last Trends Look Like?" ] }, { "cell_type": "markdown", "id": "f7b2aabe", "metadata": {}, "source": [ "You know trend means decrease monotonically, so the first trend has the biggest mean, and the last one, the smallest.\n", "How big and small are they?\n", "Use the same approach you used in the last section." ] }, { "cell_type": "code", "execution_count": null, "id": "c2ccbf23", "metadata": {}, "outputs": [], "source": [ "def trend_at(position, nrands):\n", " # return the trend at a particular position in a random sequence\n", " seq = [random() for _ in range(nrands)]\n", " trnds = trends.decompose(seq)[0] # decompose into trends\n", " return trnds[position] # and return that trend, with both attributes." ] }, { "cell_type": "markdown", "id": "9917ff8e", "metadata": {}, "source": [ "A `position` parameter means you won't have to write a separate function for first and last." ] }, { "cell_type": "code", "execution_count": null, "id": "a6558b1f", "metadata": {}, "outputs": [], "source": [ "FIRST = 0\n", "LAST = -1\n", "trend_at(FIRST, 10_000) # the first trend" ] }, { "cell_type": "code", "execution_count": null, "id": "f1c58422", "metadata": {}, "outputs": [], "source": [ "trend_at(LAST, 10_000) # the last trend" ] }, { "cell_type": "code", "execution_count": null, "id": "9e2a9e06", "metadata": {}, "outputs": [], "source": [ "def typical_trend_at(position, nrands, trials):\n", " sum_averages = 0\n", " sum_lengths = 0\n", " for _ in range(trials):\n", " trnd = trend_at(position, nrands)\n", " sum_averages += trnd.average\n", " sum_lengths += trnd.length\n", " return trends.Trend(round(sum_averages/trials, 3), length=round(sum_lengths/trials))" ] }, { "cell_type": "code", "execution_count": null, "id": "35a5181b", "metadata": {}, "outputs": [], "source": [ "typical_trend_at(FIRST, 10_000, 100)" ] }, { "cell_type": "code", "execution_count": null, "id": "a1961820", "metadata": {}, "outputs": [], "source": [ "typical_trend_at(LAST, 10_000, 100)" ] }, { "cell_type": "markdown", "id": "801adb3c", "metadata": {}, "source": [ "Now watch how it varies with sequence length:" ] }, { "cell_type": "code", "execution_count": null, "id": "5b8ab843", "metadata": {}, "outputs": [], "source": [ "def typical_trend_by_seq_length(position, max, trials, points):\n", " interval = max // points\n", " typicals = {}\n", " for nrands in range(1, max, interval):\n", " typicals[nrands] = typical_trend_at(position, nrands, trials)\n", " return typicals" ] }, { "cell_type": "markdown", "id": "7941f363", "metadata": {}, "source": [ "Here's the typical first trend:" ] }, { "cell_type": "code", "execution_count": null, "id": "0ac7fa02", "metadata": { "scrolled": true }, "outputs": [], "source": [ "typical_trend_by_seq_length(FIRST, 100_000, 100, 10)" ] }, { "cell_type": "markdown", "id": "e06e98b8", "metadata": {}, "source": [ "And the typical final trend:" ] }, { "cell_type": "code", "execution_count": null, "id": "280cb333", "metadata": {}, "outputs": [], "source": [ "typical_trend_by_seq_length(LAST, 100_000, 100, 10)" ] }, { "cell_type": "markdown", "id": "1b602b45", "metadata": {}, "source": [ "Lengths vary a lot, but the means are pretty stable: the first trend's mean is about 2/3, and the last is about 1/3.\n", "\n", "Even short random sequences show these properties." ] }, { "cell_type": "code", "execution_count": null, "id": "bd4a8452", "metadata": {}, "outputs": [], "source": [ "typical_trend_by_seq_length(0, 20, 1000, 20)" ] }, { "cell_type": "code", "execution_count": null, "id": "71fafb98", "metadata": { "scrolled": true }, "outputs": [], "source": [ "typical_trend_by_seq_length(-1, 20, 1000, 20)" ] }, { "cell_type": "markdown", "id": "dd2c3784", "metadata": {}, "source": [ "Sequences with as few as 10 random floats have first and last trends with means of around 2/3 and 1/3.\n", "\n", "For perspective, take a look at the longest trend in short sequences." ] }, { "cell_type": "code", "execution_count": null, "id": "f5ceeb79", "metadata": {}, "outputs": [], "source": [ "def typical_longest_trend_by_seq_length(max, trials, points):\n", " interval = max // points\n", " typicals = {}\n", " for nrands in range(1, max, interval):\n", " typicals[nrands] = typical_longest_trend(nrands, trials)\n", " return typicals" ] }, { "cell_type": "code", "execution_count": null, "id": "751a1c16", "metadata": {}, "outputs": [], "source": [ "typical_longest_trend_by_seq_length(20, 1000, 20)" ] }, { "cell_type": "markdown", "id": "c98dc2d8", "metadata": {}, "source": [ "## A Bigger Picture: All Trends in a Trendlist." ] }, { "cell_type": "markdown", "id": "f27f5f51", "metadata": {}, "source": [ "With this perspective, let's look at all the Trends in a Trendlist." ] }, { "cell_type": "markdown", "id": "de9bc8ae", "metadata": {}, "source": [ "### Peeking at a very large trendlist" ] }, { "cell_type": "markdown", "id": "dcaee328", "metadata": {}, "source": [ "### Peeking at a very large trendlist" ] }, { "cell_type": "markdown", "id": "ee0b8c4a", "metadata": {}, "source": [ "There are only about $ln(sequence_length)$ Trends, so you could even take a very long sequence, say a million floats,\n", "and just take a quick peek. Remember, decomposing a sequence of a million floats only takes about 4 seconds!\n", "\n", "By running the cell below, repeatedly, you can see a lot of variation. (Each run is only a single trial.)" ] }, { "cell_type": "code", "execution_count": null, "id": "16fcaf25", "metadata": { "scrolled": true }, "outputs": [], "source": [ "seq = [random() for _ in range(1_000_000)]\n", "trendlist = trends.decompose(seq)[0]\n", "trendlist" ] }, { "cell_type": "markdown", "id": "b35410a1", "metadata": {}, "source": [ "The middle trends tend to be long, with means near 1/2, with lengths shortening sharply near each end, \n", "usually accompanied by big changes in the means." ] }, { "cell_type": "markdown", "id": "d6996f23", "metadata": {}, "source": [ "### A more conventional summary" ] }, { "cell_type": "markdown", "id": "54fe8001", "metadata": {}, "source": [ "So far, you've summarized this by looking at a \"typical\" trend from each end and a \"typical\" longest trend.\n", "Another kind of summary is the conventional mean/variance pair.\n", "The `trends` package supplies a pair of methods, `Trendlist.lengths()` and `Trendlist.averages()` that make this easy.\n", "You will also want `mean()` and `variance()` from Python's `statistics` package." ] }, { "cell_type": "code", "execution_count": null, "id": "c0afbca3", "metadata": {}, "outputs": [], "source": [ "import statistics\n", "import collections\n", "\n", "import trends\n", "\n", "def basic_stats(lst):\n", " Stats = collections.namedtuple(\"Stats\", \"mean variance\")\n", " return Stats(mean = statistics.mean(lst), variance = statistics.variance(lst))" ] }, { "cell_type": "code", "execution_count": null, "id": "86f5825a", "metadata": {}, "outputs": [], "source": [ " \n", "def trendlist_stats(nrands)\n", " seq = [random() for _ in range(1_000_000)]\n", " trendlist = trends.decompose(seq)[0]\n", " trend_lengths = trendlist.lengths()\n", " trend_aves = trendlist.averages()\n", " lengths = Stats(mean=statistics.mean(trend_lengths),\n", " variance=statistics.variance(trend_lengths))\n", " aves = Stats(mean=statistics.mean(trend_aves),\n", " variance=statistics.variance(trend_aves))\n", " return (lengths, aves)" ] }, { "cell_type": "code", "execution_count": null, "id": "105c0cc8", "metadata": {}, "outputs": [], "source": [ "trendlist_stats(100_000)" ] }, { "cell_type": "markdown", "id": "ef38a2c2", "metadata": {}, "source": [ "## How Many Hiccups In a Sequence?" ] }, { "cell_type": "markdown", "id": "3516cdb1", "metadata": {}, "source": [ "You know the average number of hiccups in a random sequence will be about $ln(N)$\n", "\n", "How much will that vary? You'll want a couple of functions from Python's standard `statistics` package for that:\n", "`mean()` and `variance()`" ] }, { "cell_type": "code", "execution_count": null, "id": "0b6b1e25", "metadata": {}, "outputs": [], "source": [ "from statistics import mean, variance\n", "import collections\n", "\n", "def basic_stats(lst):\n", " Stats = collections.namedtuple(\"Stats\", \"mean variance\")\n", " return Stats(mean = statistics.mean(lst), variance = statistics.variance(lst))" ] }, { "cell_type": "code", "execution_count": null, "id": "04b2ec7b", "metadata": {}, "outputs": [], "source": [ "def trendlist_hic_stats(nrands, trials):\n", " trendlist_hiccups = []\n", " for _ in range(trials):\n", " seq = [random() for _ in range(nrands)]\n", " trendlist_hiccups.append(len(trends.decompose(seq)[0])-1)\n", " return basic_stats(trendlist_hiccups)" ] }, { "cell_type": "code", "execution_count": null, "id": "322905fa", "metadata": {}, "outputs": [], "source": [ "trendlist_hic_stats(100_000, 1000)" ] }, { "cell_type": "code", "execution_count": null, "id": "19a6d0d0", "metadata": {}, "outputs": [], "source": [ "def trend_count(nrands, trials):\n", " return [ntrends(nrands) for _ in range(trials)]" ] }, { "cell_type": "code", "execution_count": null, "id": "33487a4d", "metadata": {}, "outputs": [], "source": [ "trend_count(1000, 10)" ] }, { "cell_type": "code", "execution_count": null, "id": "41b3b643", "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(10, 10))\n", "plt.hist(trend_count(1000, 1000))\n", "plt.title(\"How Many Trends?\")\n", "plt.xlabel(\"Number of Trends\")\n", "plt.ylabel(\"Count\")" ] }, { "cell_type": "markdown", "id": "13cc4c76", "metadata": {}, "source": [ "Not bad, but looks log-normal-ish. Try a second time, using the logs of the counts." ] }, { "cell_type": "code", "execution_count": null, "id": "d876366a", "metadata": {}, "outputs": [], "source": [ "import numpy\n", "\n", "log_trend_counts = numpy.log(trend_count(1_000, 10_000))\n", "plt.hist(log_trend_counts)\n", "plt.title(\"How Many Trends?\")\n", "plt.xlabel(\"Ln(Number of Trends)\")\n", "plt.ylabel(\"Count\")" ] }, { "cell_type": "markdown", "id": "3290dd54", "metadata": {}, "source": [ "The normality test from scipy" ] }, { "cell_type": "code", "execution_count": null, "id": "eff9109c", "metadata": {}, "outputs": [], "source": [ "import scipy\n", "# scipy normality test\n", "stat, p = scipy.stats.normaltest(log_trend_counts)\n", "print('Statistics=%.3f, p=%.3f' % (stat, p))\n", "# interpret\n", "alpha = 0.05\n", "if p > alpha:\n", "\tprint('Sample looks Gaussian (fail to reject H0)')\n", "else:\n", " print('Sample does not look Gaussian (reject H0)')" ] }, { "cell_type": "code", "execution_count": null, "id": "14018e62", "metadata": {}, "outputs": [], "source": [ "from statsmodels.graphics.gofplots import qqplot\n", "hiccups = np.array(trend_count(10_000, 10_000)) - 1\n", "qqplot(np.log(hiccups), line='s')\n", "plt.show()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.0" } }, "nbformat": 4, "nbformat_minor": 5 }