{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Data Anomaly Detection and Explanation App\n", "\n", "Overview:\n", "\n", "This application is designed to identify anomalies or outliers in datasets, leveraging the strengths of advanced Python scripting for data analysis combined with the interpretative power of a locally hosted large language model (LLM). By separating the data analysis and narrative explanation processes, the app effectively addresses the limitations commonly associated with LLMs handling raw tabular data.\n", "\n", "At its core, the app first employs sophisticated Python algorithms to sift through the dataset, identifying statistical outliers based on predefined criteria. Once these outliers are identified, the relevant insights are distilled into a structured format and passed to the LLM. The LLM then acts not as a data analyst but as a sophisticated interpreter, generating detailed, understandable narratives that explain the significance of these outliers in a context meaningful to business leaders.\n", "\n", "Key Features:\n", "\n", "1. Anomaly Detection: The app employs sophisticated data analysis techniques to identify anomalies or outliers in the provided dataset. It takes into account various dimensions and measures specified by the user to accurately pinpoint unusual patterns or deviations from the norm.\n", "2. Intelligent Explanation: Once the anomalies are detected, the app utilizes a locally hosted LLM to generate clear and concise explanations for each identified outlier. The LLM acts as an interpreter, providing insights into the potential causes or factors contributing to the anomalies without directly interacting with the tabular data.\n", "3. Customizable Parameters: Users can easily customize the app's parameters to suit their specific needs. They can define the dimensions, measures, summarization methods, coverage thresholds, and other relevant settings to tailor the anomaly detection process to their unique business requirements.\n", "4. Monthly Trend Analysis: In addition to anomaly detection, the app offers a monthly trend analysis feature. It generates narratives that highlight the top increases and decreases for each dimension over time, providing valuable insights into the temporal patterns within the data.\n", "5. Narrative Enhancement: The app integrates with a language model to enhance the generated narratives. It can provide more detailed overviews, highlight critical outliers or anomalies, and present the information in a clear and concise manner suitable for executive leadership.\n", "\n", "Benefits for Businesses and Leadership Teams:\n", "\n", "1. Data-Driven Decision Making: By identifying and explaining anomalies in the data, the app enables businesses to make more accurate and data-driven decisions. Leadership teams can rely on the app's insights to identify potential issues, opportunities, or areas that require further investigation.\n", "2. Time and Resource Efficiency: Manually analyzing large datasets for anomalies can be time-consuming and resource-intensive. The app automates the anomaly detection process, saving valuable time and allowing teams to focus on strategic tasks and decision-making.\n", "3. Improved Risk Management: Anomalies in data can often indicate potential risks or issues that may impact the business. By proactively identifying and understanding these anomalies, leadership teams can take timely actions to mitigate risks and prevent potential losses.\n", "4. Enhanced Communication and Collaboration: The app's narrative generation feature facilitates effective communication and collaboration among team members. The clear and concise explanations of anomalies enable stakeholders from different departments to understand the insights quickly, fostering a shared understanding of the data and its implications.\n", "5. Competitive Advantage: By leveraging advanced data analysis techniques and LLMs, businesses can gain a competitive edge in their respective industries. The app's ability to uncover hidden patterns and provide meaningful explanations can lead to data-driven innovations and improved decision-making processes.\n", "\n", "The Data Anomaly Detection and Explanation App is a powerful tool that empowers businesses to harness the full potential of their data. By streamlining the process of data analysis and report generation, it enhances the overall accessibility of data-driven insights for strategic business use. The application demystifies complex datasets, enabling leadership teams to take confident, informed actions based on robust, AI-enhanced analytics. It not only saves time and resources but also facilitates effective communication and collaboration, ultimately driving business growth and competitive advantage.\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import datetime as dt\n", "from typing import Optional\n", "import inflect\n", "import requests\n", "import json\n", "\n", "import warnings\n", "warnings.simplefilter(action='ignore', category=FutureWarning)\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Description: This script contains helper functions for generating narratives from data\n", "\n", "def pluralize(word):\n", " engine = inflect.engine()\n", " plural = engine.plural(word)\n", " return(plural)\n", "\n", "def get_descriptive_outliers(\n", " df, \n", " dimension, \n", " measure, \n", " total=None, \n", " summarization=\"sum\", \n", " coverage=0.5, \n", " coverage_limit=5):\n", " \"\"\"\n", " Returns descriptive outliers based on a given dataframe, dimension, and measure.\n", " \"\"\"\n", " table = df.groupby(dimension)[measure].agg(\n", " sum if summarization == \"sum\" else \n", " pd.Series.nunique if summarization == \"count\" else \n", " \"mean\"\n", " ).reset_index().sort_values(by=measure, ascending=False)\n", "\n", " if summarization in [\"sum\", \"count\"]:\n", " if total is None:\n", " total = table[measure].sum().round(2)\n", "\n", " table = (\n", " table.assign(\n", " share=lambda x: x[measure]/total)\n", " .assign(cum_share=lambda x: x['share'].cumsum())\n", " .loc[lambda x: (x['cum_share'] >= coverage).shift(fill_value=False).cumsum() == 0]\n", " .iloc[:coverage_limit]\n", " )\n", "\n", " if df.shape[0] == 1 and table['cum_share'].iloc[0] == 1:\n", " return None\n", "\n", " elif summarization == 'average':\n", " if total is None:\n", " total = table[measure].mean().round(2)\n", "\n", " table = (table\n", " .assign(share = lambda x: abs(x[measure]/total - 1)) # Use absolute deviation from the mean\n", " .assign(abs_share=lambda x: x['share'].abs())\n", " .sort_values(by='abs_share', ascending=False)\n", " .assign(cum_share=lambda x: x['share'].abs().cumsum()/(x['share'].max() - x['share'].min()))\n", " .loc[lambda x: (x['cum_share'] >= coverage*2).shift(fill_value=False).cumsum() == 0]\n", " .iloc[:coverage_limit]\n", " )\n", "\n", " n_outliers = table.shape[0]\n", " outlier_levels = table[dimension].astype(str).values.tolist()\n", " outlier_values = table[measure].round(1).values.tolist()\n", " outlier_values_p = (table[\"share\"].round(2) * 100).astype(str).add(\"%\").values.tolist()\n", "\n", " output = {\n", " \"n_outliers\": n_outliers,\n", " \"outlier_levels\": outlier_levels,\n", " \"outlier_values\": outlier_values,\n", " \"outlier_values_p\": outlier_values_p\n", " }\n", "\n", " return output\n", "\n", "def narrate_descriptive(\n", " df,\n", " measure = None,\n", " dimensions = None,\n", " summarization = 'sum',\n", " coverage = 0.5,\n", " coverage_limit = 5,\n", " narration_depth = 2,\n", " template_total = 'Total {measure} across all {pluralize(dimension_one)} is {total}.',\n", " template_average = 'Average {measure} across all {pluralize(dimension_one)} is {total}.',\n", " template_outlier = 'Outlying {dimension} by {measure} is {outlier_insight}.',\n", " template_outlier_multiple = 'Outlying {pluralize(dimension)} by {measure} are {outlier_insight}.',\n", " template_outlier_l2 = 'In {level_l1}, significant {level_l2} by {measure} is {outlier_insight}.',\n", " template_outlier_l2_multiple = 'In {level_l1}, significant {pluralize(level_l2)} by {measure} are {outlier_insight}.',\n", " return_data = False,\n", " simplify = False\n", " ):\n", " \"\"\"\n", " This function generates a narrative report based on a given data frame and parameters.\n", " \"\"\"\n", " # Assert data frame\n", " if not isinstance(df, pd.DataFrame):\n", " print('df must be a pandas DataFrame')\n", " return\n", " \n", " if isinstance(measure, type(None)):\n", " measure = df.\\\n", " select_dtypes(include = 'number').\\\n", " columns[0]\n", " \n", " if isinstance(dimensions, type(None)):\n", " dimensions = df.\\\n", " select_dtypes(include = ['object', 'category']).\\\n", " columns.\\\n", " values.\\\n", " tolist()\n", " \n", " dimension_one = dimensions[0]\n", " \n", " if summarization == 'sum':\n", " total_raw = df[measure].sum().round(2)\n", " elif summarization == 'average':\n", " total_raw = df[measure].mean().round(2)\n", " elif summarization == 'count':\n", " total_raw = df[measure].count()\n", "\n", " total = total_raw\n", " \n", " narrative_total = eval(f\"f'{template_total}'\")\n", " \n", " narrative = {\n", " f'Total {measure}': narrative_total\n", " } \n", " \n", " variables = {\n", " f'Total {measure}': {\n", " 'narrative_total': narrative_total,\n", " 'template_total': template_total,\n", " 'measure': measure,\n", " 'dimension_one': dimension_one,\n", " 'total': total,\n", " 'total_raw': total_raw\n", " }\n", " }\n", "\n", " # High-Level Narrative\n", " for dimension in dimensions:\n", "\n", " output = get_descriptive_outliers(\n", " df = df,\n", " dimension=dimension,\n", " measure=measure,\n", " # we need overall total for average only, in other cases it leads to incorrect output\n", " total = None if summarization in [\"sum\", \"count\"] else total_raw,\n", " summarization = summarization,\n", " coverage = coverage,\n", " coverage_limit = coverage_limit\n", " )\n", "\n", " if output is None:\n", " continue\n", "\n", " # Outputting all to the global env\n", " n_outliers = output['n_outliers']\n", " outlier_levels = output['outlier_levels']\n", " outlier_values = output['outlier_values']\n", " outlier_values_p = output['outlier_values_p']\n", "\n", " if summarization == 'average':\n", " outlier_insight = ', '.join([f\"{outlier_levels} ({outlier_values}, {outlier_values_p} vs average {measure})\" for outlier_levels, outlier_values, outlier_values_p in zip(outlier_levels, outlier_values, outlier_values_p)])\n", " else:\n", " outlier_insight = ', '.join([f\"{outlier_levels} ({outlier_values}, {outlier_values_p})\" for outlier_levels, outlier_values, outlier_values_p in zip(outlier_levels, outlier_values, outlier_values_p)])\n", "\n", " if n_outliers > 1:\n", " template_outlier_final = template_outlier_multiple\n", " template_selected = \"multiple\"\n", " else:\n", " template_outlier_final = template_outlier\n", " template_selected = \"single\"\n", "\n", " narrative_outlier_final = {\n", " f'{dimension} by {measure}': eval(f\"f'{template_outlier_final}'\")\n", " }\n", " \n", " narrative.update(narrative_outlier_final)\n", "\n", " if template_selected == 'single':\n", " variables_l1 = { \n", " f'{dimension} by {measure}': {\n", " 'narrative_outlier_final': narrative_outlier_final, \n", " 'template_outlier': template_outlier, \n", " 'dimension': dimension, \n", " 'measure': measure, \n", " 'outlier_insight': outlier_insight, \n", " 'n_outliers': n_outliers, \n", " 'outlier_levels': outlier_levels, \n", " 'outlier_values': outlier_values, \n", " 'outlier_values_p': outlier_values_p \n", " }\n", " }\n", "\n", " if template_selected == 'multiple':\n", " variables_l1 = { \n", " f'{dimension} by {measure}': {\n", " 'narrative_outlier_final': narrative_outlier_final, \n", " 'template_outlier_multiple': template_outlier_multiple, \n", " 'dimension': dimension, \n", " 'measure': measure, \n", " 'outlier_insight': outlier_insight, \n", " 'n_outliers': n_outliers, \n", " 'outlier_levels': outlier_levels, \n", " 'outlier_values': outlier_values, \n", " 'outlier_values_p': outlier_values_p \n", " }\n", " }\n", "\n", " variables.update(variables_l1)\n", " \n", " # Output\n", " if return_data == True:\n", " return(variables)\n", " \n", " if simplify == True:\n", " narrative = list(narrative.values())\n", " \n", " return(narrative)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# Description: This script contains helper functions for obtaining the frequency of the data\n", "\n", "def get_frequency(df, date_field=None):\n", " \"\"\"\n", " Get Date Frequency of the Data\n", " \"\"\"\n", " \n", " if not isinstance(df, pd.DataFrame):\n", " raise ValueError(\"'df' must be a pandas DataFrame\")\n", " \n", " if len(df) == 0:\n", " raise ValueError(\"'df' must have at least 1 row\")\n", " \n", " if date_field is None:\n", " date_fields = df.select_dtypes(include=[np.datetime64]).columns if not df.select_dtypes(include=[np.datetime64]).columns.empty else None\n", " \n", " if date_fields is None:\n", " raise ValueError(\"No date field detected in 'df'\")\n", " elif len(date_fields) > 1:\n", " raise ValueError(\"Multiple date fields detected in 'df', please specify 'date_field'\")\n", " else:\n", " date_field = date_fields[0]\n", " else:\n", " if date_field not in df.columns:\n", " raise ValueError(\"'date_field' must be present in the supplied data frame\")\n", " elif not np.issubdtype(df[date_field].dtype, np.datetime64):\n", " raise ValueError(\"'date_field' must be of datetime type\")\n", " \n", " df = df.rename(columns={date_field: \"date_field\"})\n", " \n", " est_frequency = df[\"date_field\"].diff().dt.days.abs().value_counts().idxmax()\n", " \n", " if est_frequency > 300:\n", " frequency = \"year\"\n", " elif est_frequency > 35:\n", " frequency = \"quarter\"\n", " elif est_frequency > 8:\n", " frequency = \"month\"\n", " elif est_frequency > 3:\n", " frequency = \"week\"\n", " else:\n", " frequency = \"day\"\n", " \n", " return frequency\n", "\n", "def get_py_date(df: pd.DataFrame, frequency: Optional[str] = None):\n", " \"\"\"\n", " Calculate the prior year date based on the maximum date in the DataFrame and the given frequency.\n", " \"\"\"\n", " # Table must be a pandas DataFrame and have at least one row\n", " if not isinstance(df, pd.DataFrame):\n", " raise ValueError(\"df must be a pandas DataFrame\")\n", " if df.shape[0] == 0:\n", " raise ValueError(\"df must have at least one row, execution is stopped\")\n", " \n", " date_field = df.select_dtypes(include=[np.datetime64]).columns[0] if not df.select_dtypes(include=[np.datetime64]).columns.empty else None\n", " \n", " if date_field is None:\n", " raise ValueError(\"Data frame must contain one date column\")\n", "\n", " # Calculating frequency if not available\n", " if frequency is None:\n", " frequency = get_frequency(df)\n", "\n", " max_date = df[date_field].max()\n", " max_year = max_date.year\n", "\n", " if frequency == \"week\":\n", " df_weekly = (\n", " df[[date_field]]\n", " .drop_duplicates()\n", " .sort_values(by=date_field)\n", " .assign(week=lambda x: x[date_field].dt.isocalendar().week,\n", " year=lambda x: x[date_field].dt.year)\n", " )\n", "\n", " max_week = df_weekly.loc[df_weekly[date_field] == max_date, \"week\"].iat[0]\n", "\n", " py_date = df_weekly.loc[(df_weekly[\"year\"] == max_year - 1) & (df_weekly[\"week\"] == max_week), date_field].values\n", " py_date = pd.to_datetime(py_date)\n", "\n", " if len(py_date) == 0:\n", " py_date = max_date - pd.DateOffset(years=1)\n", " else:\n", " py_date = py_date[0]\n", " else:\n", " py_date = max_date - pd.DateOffset(years=1)\n", "\n", " return py_date.date()\n", "\n", "\n", "def ytd_volume(\n", " df, \n", " measure = None, \n", " date = None, \n", " summarization = \"sum\", \n", " current_year = None, \n", " cy_date = None):\n", " \"\"\"\n", " Calculate the year-to-date (YTD) volume of a given measure in a pandas DataFrame.\n", " \"\"\"\n", " # Table must be a pandas DataFrame and have at least one row\n", " if not isinstance(df, pd.DataFrame):\n", " raise ValueError(\"df must be a pandas DataFrame\")\n", " if df.shape[0] == 0:\n", " raise ValueError(\"df must have at least one row, execution is stopped\")\n", " \n", " # Summarization Assertion\n", " if summarization not in [\"sum\", \"count\", \"average\"]:\n", " raise ValueError(\"summarization must of be one of: 'sum', 'count' or 'mean'.\")\n", " \n", " # Measure, Date and Dimensions Assertion\n", " if measure is not None:\n", " if measure not in df.columns:\n", " raise ValueError(\"measure must a column in the dataset\")\n", " else:\n", " # If measure isn't supplied get the first numerical column from it\n", " measure = df.select_dtypes(include=[np.number]).columns[0]\n", " \n", " # Get Date\n", " if date is not None:\n", " if date not in df.columns:\n", " raise ValueError(\"date must a column in the dataset\")\n", " \n", " df[date] = pd.to_datetime(df[date])\n", " \n", " if not pd.api.types.is_datetime64_any_dtype(df[date]):\n", " raise ValueError(\"'date' must be a date column in the dataset\")\n", " else:\n", " # Getting the first date field available\n", " date = df.select_dtypes(include=[np.datetime64]).columns[0] if not df.select_dtypes(include=[np.datetime64]).columns.empty else None\n", " \n", " if date is None:\n", " raise ValueError(\"No date column found in the dataset\")\n", "\n", " # Current Year's Date\n", " if cy_date is None:\n", " cy_date = df[date].max()\n", " else:\n", " cy_date = pd.to_datetime(cy_date)\n", " \n", " # Current year assertion\n", " if current_year is not None and current_year != cy_date.year:\n", " try:\n", " current_year = int(current_year)\n", " except:\n", " raise ValueError(\"current_year argument must be numeric or convertable to numeric like 2022 or '2022' \")\n", " else:\n", " current_year = cy_date.year\n", "\n", " cy_volume = (df.assign(year=df[date].dt.year)\n", " .query('year == @current_year and `{0}` <= @cy_date'.format(date))\n", " .agg({measure: summarization})\n", " .squeeze()\n", " )\n", " \n", " cy_volume = cy_volume.round(2)\n", " \n", " return cy_volume\n", "\n", "def pytd_volume(\n", " df, \n", " measure = None, \n", " date = None, \n", " summarization = \"sum\", \n", " current_year = None, \n", " py_date = None):\n", " \"\"\"\n", " Calculate the previous year-to-date (PYTD) volume of a given measure in a pandas DataFrame.\n", " \"\"\"\n", " # Table must be a pandas DataFrame and have at least one row\n", " if not isinstance(df, pd.DataFrame):\n", " raise ValueError(\"df must be a pandas DataFrame\")\n", " if df.shape[0] == 0:\n", " raise ValueError(\"df must have at least one row, execution is stopped\")\n", " \n", " # Summarization Assertion\n", " if summarization not in [\"sum\", \"count\", \"average\"]:\n", " raise ValueError(\"summarization must of be one of: 'sum', 'count' or 'mean'.\")\n", " \n", " # Measure, Date and Dimensions Assertion\n", " if measure is not None:\n", " if measure not in df.columns:\n", " raise ValueError(\"measure must a column in the dataset\")\n", " else:\n", " # If measure isn't supplied get the first numerical column from it\n", " measure = df.select_dtypes(include=[np.number]).columns[0]\n", " \n", " # Get Date\n", " if date is not None:\n", " if date not in df.columns:\n", " raise ValueError(\"date must a column in the dataset\")\n", " \n", " df[date] = pd.to_datetime(df[date])\n", " \n", " if not pd.api.types.is_datetime64_any_dtype(df[date]):\n", " raise ValueError(\"'date' must be a date column in the dataset\")\n", " else:\n", " # Getting the first date field available\n", " date = df.select_dtypes(include=[np.datetime64]).columns[0] if not df.select_dtypes(include=[np.datetime64]).columns.empty else None\n", " \n", " if date is None:\n", " raise ValueError(\"No date column found in the dataset\")\n", "\n", " # Current Year's Date\n", " if py_date is None:\n", " py_date = get_py_date(df)\n", " else:\n", " py_date = pd.to_datetime(py_date)\n", " \n", " # Current year assertion\n", " if current_year is not None and current_year - 1 != py_date.year:\n", " try:\n", " current_year = int(current_year)\n", " except:\n", " raise ValueError(\"current_year argument must be numeric or convertable to numeric like 2022 or '2022' \")\n", " else:\n", " previous_year = py_date.year\n", "\n", " py_volume = (df.assign(year=df[date].dt.year)\n", " .query('year == @previous_year and `{0}` <= @py_date'.format(date))\n", " .agg({measure: summarization})\n", " .squeeze()\n", " )\n", " \n", " py_volume = py_volume.round(2)\n", "\n", " return py_volume" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# Description: This script contains helper functions for identifying outliers in the data\n", "\n", "def get_trend_outliers(\n", " df, \n", " dimension, \n", " measure, \n", " total = None, \n", " summarization = \"sum\", \n", " coverage = 0.5, \n", " coverage_limit = 5):\n", " \"\"\"\n", " Returns trend outliers based on a given dataframe, dimension, and measure.\n", " \"\"\"\n", " grouped = df.groupby(dimension)\n", "\n", " # Define a function to apply to each group\n", " def process_group(group, ytd_volume, pytd_volume, summarization, measure):\n", " curr_volume = ytd_volume(group, summarization=summarization, measure=measure)\n", " prev_volume = pytd_volume(group, summarization=summarization, measure=measure)\n", " change = curr_volume - prev_volume\n", " change_p = f\"{round(change / prev_volume * 100, 2)}%\"\n", " abs_change = abs(change)\n", " trend = \"increase\" if change > 0 else \"decrease\"\n", " \n", " output = pd.Series({\n", " \"curr_volume\": curr_volume,\n", " \"prev_volume\": prev_volume,\n", " \"change\": change,\n", " \"change_p\": change_p,\n", " \"abs_change\": abs_change,\n", " \"trend\": trend,\n", " })\n", " \n", " return output\n", "\n", " # Apply the function to each group and create a new DataFrame\n", " table = grouped.apply(process_group, ytd_volume, pytd_volume, summarization, measure)\n", "\n", " # Reset the index and sort by abs_change\n", " table = table.reset_index().sort_values(by=\"abs_change\", ascending=False)\n", "\n", " if summarization in [\"sum\", \"count\"]:\n", " if total is None:\n", " total = df[measure].sum().round(2)\n", "\n", " table = (\n", " table.assign(share=lambda x: x['abs_change'] / x['abs_change'].sum())\n", " .assign(cum_share=lambda x: x['share'].cumsum())\n", " .assign(lag_cum_share=lambda x: x['cum_share'].shift(fill_value=False))\n", " ).reset_index()\n", "\n", " table = table[table['lag_cum_share'] < coverage]\n", " table = table.head(coverage_limit)\n", "\n", " if df.shape[0] == 1 and table['cum_share'].iloc[0] == 1:\n", " return None\n", "\n", " elif summarization == 'average':\n", " if total is None:\n", " total = table[measure].mean().round(2)\n", "\n", " table = (table\n", " .assign(share = lambda x: x[measure]/total - 1)\n", " .assign(abs_share=lambda x: x['share'].abs())\n", " .sort_values(by='abs_share', ascending=False)\n", " .assign(cum_share=lambda x: x['share'].abs().cumsum()/(x['share'].max() - x['share'].min()))\n", " .loc[lambda x: (x['cum_share'] >= coverage*2).shift(fill_value=False).cumsum() == 0]\n", " .iloc[:coverage_limit]\n", " )\n", "\n", " n_outliers = table.shape[0]\n", " outlier_levels = table[dimension].astype(str).values.tolist()\n", " outlier_values = table[measure].round(1).values.tolist()\n", " outlier_values_p = (table[\"share\"].round(2) * 100).astype(str).add(\"%\").values.tolist()\n", "\n", " output = {\n", " \"n_outliers\": n_outliers,\n", " \"outlier_levels\": outlier_levels,\n", " \"outlier_values\": outlier_values,\n", " \"outlier_values_p\": outlier_values_p\n", " }\n", "\n", " return output\n", "\n", "\n", "# Monthly Trend Analysis\n", "\n", "def monthly_trend(df, measure, dimensions, date_column='Date', summarization='sum', top_n=3):\n", " # Ensure the date column is in datetime format\n", " df[date_column] = pd.to_datetime(df[date_column])\n", " df['Year'] = df[date_column].dt.year\n", " df['Month'] = df[date_column].dt.month\n", "\n", " # Group by dimensions, Year, and Month\n", " group_by_columns = dimensions + ['Year', 'Month']\n", " if summarization == 'sum':\n", " summary_func = np.sum\n", " elif summarization == 'average':\n", " summary_func = np.mean\n", " elif summarization == 'count':\n", " summary_func = 'count'\n", " else:\n", " raise ValueError(\"Invalid summarization method. Choose from 'sum', 'average', 'count'.\")\n", "\n", " summary = df.groupby(group_by_columns)[measure].agg(summary_func).reset_index(name='Measure')\n", " summary.sort_values(by=group_by_columns, inplace=True)\n", " summary['Change'] = summary.groupby(dimensions)['Measure'].diff().fillna(0)\n", " summary['Percentage_Change'] = summary.groupby(dimensions)['Measure'].pct_change().fillna(0) * 100\n", "\n", " narrative = {}\n", " for dimension in dimensions:\n", " dimension_summary = summary.groupby([dimension, 'Year', 'Month']).agg({'Change': 'sum', 'Percentage_Change': 'mean'}).reset_index()\n", " # Top N positive changes\n", " top_increases = dimension_summary.nlargest(top_n, 'Change')\n", " # Top N negative changes\n", " top_decreases = dimension_summary.nsmallest(top_n, 'Change')\n", "\n", " narrative[dimension + ' Increases'] = '; '.join([f\"{row[dimension]}, {row['Year']}-{row['Month']:02d}: Change={row['Change']:.2f}, ({row['Percentage_Change']:.2f}%)\" for index, row in top_increases.iterrows()])\n", " narrative[dimension + ' Decreases'] = '; '.join([f\"{row[dimension]}, {row['Year']}-{row['Month']:02d}: Change={row['Change']:.2f}, ({row['Percentage_Change']:.2f}%)\" for index, row in top_decreases.iterrows()])\n", "\n", " # Overall summary\n", " total_increase_months = summary[summary['Change'] > 0].shape[0]\n", " total_decrease_months = summary[summary['Change'] < 0].shape[0]\n", " narrative['Overall Summary'] = f\"Total months with increase: {total_increase_months}, Total months with decrease: {total_decrease_months}\"\n", "\n", " return narrative" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# Configure the OpenAI API client with the base URL for Ollama\n", "from openai import OpenAI\n", "openai_api_key = 'ollama'\n", "\n", "# Instantiate the OpenAI client with the base_url for Ollama\n", "client = OpenAI(\n", " api_key=openai_api_key, \n", " base_url='http://localhost:11434/v1' # Ollama's base URL\n", ")\n", "\n", "def gpt_get_completions(\n", " prompt,\n", " openai_api_key=openai_api_key \n", "):\n", "\n", " # Set up the OpenAI API client with the messages and model information\n", " chat_completion = client.chat.completions.create(\n", " messages=[\n", " {\"role\": \"system\", \"content\": \"You are an expert business intelligence data analyst.\"},\n", " {\"role\": \"user\", \"content\": prompt}\n", " ],\n", " model=\"llama3:latest\", # Replace with your desired model - llama2:latest is the default\n", " )\n", "\n", " # Return the content of the first message in the completions\n", " return chat_completion.choices[0].message.content\n", "\n", "def enhance_narrative(\n", " narrative,\n", " openai_api_key=openai_api_key\n", "):\n", " prompt = f'Provide a detailed overview of the information provided to executive leadership and make sure to highlight any outliers or anomalies in the data: \"{narrative}\"'\n", " output = gpt_get_completions(prompt, openai_api_key=openai_api_key)\n", " return(output)\n", "\n", "def translate_narrative(\n", " narrative,\n", " language,\n", " openai_api_key=openai_api_key\n", "):\n", " prompt = f'Using professional language translate the following text to {language}: \"{narrative}\"'\n", " output = gpt_get_completions(prompt, openai_api_key=openai_api_key)\n", " return(output)\n", "\n", "def summarize_narrative(\n", " narrative,\n", " openai_api_key=openai_api_key\n", "):\n", " prompt = f'Summarize the following narrative to make it shorter: \"{narrative}\"'\n", " output = gpt_get_completions(prompt, openai_api_key=openai_api_key)\n", " return(output)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# Read the dataset into a pandas DataFrame\n", "filename = 'sales.csv'\n", "\n", "def read_data():\n", " return pd.read_csv(filename, encoding='latin-1', keep_default_na=False)\n", "\n", "data = read_data()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | Order ID | \n", "Date | \n", "Region | \n", "Product | \n", "Store | \n", "Promotion | \n", "Price | \n", "Quantity | \n", "Sales | \n", "
---|---|---|---|---|---|---|---|---|---|
0 | \n", "QR10276 | \n", "1/30/2019 | \n", "Europe | \n", "Tools | \n", "Big | \n", "1 | \n", "89.6 | \n", "28 | \n", "2007.04 | \n", "
1 | \n", "QR15245 | \n", "7/26/2020 | \n", "Asia | \n", "Tools | \n", "Big | \n", "0 | \n", "268.5 | \n", "7 | \n", "1879.50 | \n", "
2 | \n", "QR13938 | \n", "3/5/2020 | \n", "Europe | \n", "Home | \n", "Big | \n", "0 | \n", "223.3 | \n", "18 | \n", "4019.40 | \n", "
3 | \n", "QR15934 | \n", "10/8/2020 | \n", "South America | \n", "Clothing | \n", "Big | \n", "0 | \n", "210.4 | \n", "27 | \n", "5680.80 | \n", "
4 | \n", "QR10963 | \n", "4/15/2019 | \n", "Europe | \n", "Tools | \n", "Big | \n", "0 | \n", "193.1 | \n", "22 | \n", "4248.20 | \n", "