{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Data Anomaly Detection and Explanation App\n", "\n", "Overview:\n", "\n", "This application is designed to identify anomalies or outliers in datasets, leveraging the strengths of advanced Python scripting for data analysis combined with the interpretative power of a locally hosted large language model (LLM). By separating the data analysis and narrative explanation processes, the app effectively addresses the limitations commonly associated with LLMs handling raw tabular data.\n", "\n", "At its core, the app first employs sophisticated Python algorithms to sift through the dataset, identifying statistical outliers based on predefined criteria. Once these outliers are identified, the relevant insights are distilled into a structured format and passed to the LLM. The LLM then acts not as a data analyst but as a sophisticated interpreter, generating detailed, understandable narratives that explain the significance of these outliers in a context meaningful to business leaders.\n", "\n", "Key Features:\n", "\n", "1. Anomaly Detection: The app employs sophisticated data analysis techniques to identify anomalies or outliers in the provided dataset. It takes into account various dimensions and measures specified by the user to accurately pinpoint unusual patterns or deviations from the norm.\n", "2. Intelligent Explanation: Once the anomalies are detected, the app utilizes a locally hosted LLM to generate clear and concise explanations for each identified outlier. The LLM acts as an interpreter, providing insights into the potential causes or factors contributing to the anomalies without directly interacting with the tabular data.\n", "3. Customizable Parameters: Users can easily customize the app's parameters to suit their specific needs. They can define the dimensions, measures, summarization methods, coverage thresholds, and other relevant settings to tailor the anomaly detection process to their unique business requirements.\n", "4. Monthly Trend Analysis: In addition to anomaly detection, the app offers a monthly trend analysis feature. It generates narratives that highlight the top increases and decreases for each dimension over time, providing valuable insights into the temporal patterns within the data.\n", "5. Narrative Enhancement: The app integrates with a language model to enhance the generated narratives. It can provide more detailed overviews, highlight critical outliers or anomalies, and present the information in a clear and concise manner suitable for executive leadership.\n", "\n", "Benefits for Businesses and Leadership Teams:\n", "\n", "1. Data-Driven Decision Making: By identifying and explaining anomalies in the data, the app enables businesses to make more accurate and data-driven decisions. Leadership teams can rely on the app's insights to identify potential issues, opportunities, or areas that require further investigation.\n", "2. Time and Resource Efficiency: Manually analyzing large datasets for anomalies can be time-consuming and resource-intensive. The app automates the anomaly detection process, saving valuable time and allowing teams to focus on strategic tasks and decision-making.\n", "3. Improved Risk Management: Anomalies in data can often indicate potential risks or issues that may impact the business. By proactively identifying and understanding these anomalies, leadership teams can take timely actions to mitigate risks and prevent potential losses.\n", "4. Enhanced Communication and Collaboration: The app's narrative generation feature facilitates effective communication and collaboration among team members. The clear and concise explanations of anomalies enable stakeholders from different departments to understand the insights quickly, fostering a shared understanding of the data and its implications.\n", "5. Competitive Advantage: By leveraging advanced data analysis techniques and LLMs, businesses can gain a competitive edge in their respective industries. The app's ability to uncover hidden patterns and provide meaningful explanations can lead to data-driven innovations and improved decision-making processes.\n", "\n", "The Data Anomaly Detection and Explanation App is a powerful tool that empowers businesses to harness the full potential of their data. By streamlining the process of data analysis and report generation, it enhances the overall accessibility of data-driven insights for strategic business use. The application demystifies complex datasets, enabling leadership teams to take confident, informed actions based on robust, AI-enhanced analytics. It not only saves time and resources but also facilitates effective communication and collaboration, ultimately driving business growth and competitive advantage.\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import datetime as dt\n", "from typing import Optional\n", "import inflect\n", "import requests\n", "import json\n", "\n", "import warnings\n", "warnings.simplefilter(action='ignore', category=FutureWarning)\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Description: This script contains helper functions for generating narratives from data\n", "\n", "def pluralize(word):\n", " engine = inflect.engine()\n", " plural = engine.plural(word)\n", " return(plural)\n", "\n", "def get_descriptive_outliers(\n", " df, \n", " dimension, \n", " measure, \n", " total=None, \n", " summarization=\"sum\", \n", " coverage=0.5, \n", " coverage_limit=5):\n", " \"\"\"\n", " Returns descriptive outliers based on a given dataframe, dimension, and measure.\n", " \"\"\"\n", " table = df.groupby(dimension)[measure].agg(\n", " sum if summarization == \"sum\" else \n", " pd.Series.nunique if summarization == \"count\" else \n", " \"mean\"\n", " ).reset_index().sort_values(by=measure, ascending=False)\n", "\n", " if summarization in [\"sum\", \"count\"]:\n", " if total is None:\n", " total = table[measure].sum().round(2)\n", "\n", " table = (\n", " table.assign(\n", " share=lambda x: x[measure]/total)\n", " .assign(cum_share=lambda x: x['share'].cumsum())\n", " .loc[lambda x: (x['cum_share'] >= coverage).shift(fill_value=False).cumsum() == 0]\n", " .iloc[:coverage_limit]\n", " )\n", "\n", " if df.shape[0] == 1 and table['cum_share'].iloc[0] == 1:\n", " return None\n", "\n", " elif summarization == 'average':\n", " if total is None:\n", " total = table[measure].mean().round(2)\n", "\n", " table = (table\n", " .assign(share = lambda x: abs(x[measure]/total - 1)) # Use absolute deviation from the mean\n", " .assign(abs_share=lambda x: x['share'].abs())\n", " .sort_values(by='abs_share', ascending=False)\n", " .assign(cum_share=lambda x: x['share'].abs().cumsum()/(x['share'].max() - x['share'].min()))\n", " .loc[lambda x: (x['cum_share'] >= coverage*2).shift(fill_value=False).cumsum() == 0]\n", " .iloc[:coverage_limit]\n", " )\n", "\n", " n_outliers = table.shape[0]\n", " outlier_levels = table[dimension].astype(str).values.tolist()\n", " outlier_values = table[measure].round(1).values.tolist()\n", " outlier_values_p = (table[\"share\"].round(2) * 100).astype(str).add(\"%\").values.tolist()\n", "\n", " output = {\n", " \"n_outliers\": n_outliers,\n", " \"outlier_levels\": outlier_levels,\n", " \"outlier_values\": outlier_values,\n", " \"outlier_values_p\": outlier_values_p\n", " }\n", "\n", " return output\n", "\n", "def narrate_descriptive(\n", " df,\n", " measure = None,\n", " dimensions = None,\n", " summarization = 'sum',\n", " coverage = 0.5,\n", " coverage_limit = 5,\n", " narration_depth = 2,\n", " template_total = 'Total {measure} across all {pluralize(dimension_one)} is {total}.',\n", " template_average = 'Average {measure} across all {pluralize(dimension_one)} is {total}.',\n", " template_outlier = 'Outlying {dimension} by {measure} is {outlier_insight}.',\n", " template_outlier_multiple = 'Outlying {pluralize(dimension)} by {measure} are {outlier_insight}.',\n", " template_outlier_l2 = 'In {level_l1}, significant {level_l2} by {measure} is {outlier_insight}.',\n", " template_outlier_l2_multiple = 'In {level_l1}, significant {pluralize(level_l2)} by {measure} are {outlier_insight}.',\n", " return_data = False,\n", " simplify = False\n", " ):\n", " \"\"\"\n", " This function generates a narrative report based on a given data frame and parameters.\n", " \"\"\"\n", " # Assert data frame\n", " if not isinstance(df, pd.DataFrame):\n", " print('df must be a pandas DataFrame')\n", " return\n", " \n", " if isinstance(measure, type(None)):\n", " measure = df.\\\n", " select_dtypes(include = 'number').\\\n", " columns[0]\n", " \n", " if isinstance(dimensions, type(None)):\n", " dimensions = df.\\\n", " select_dtypes(include = ['object', 'category']).\\\n", " columns.\\\n", " values.\\\n", " tolist()\n", " \n", " dimension_one = dimensions[0]\n", " \n", " if summarization == 'sum':\n", " total_raw = df[measure].sum().round(2)\n", " elif summarization == 'average':\n", " total_raw = df[measure].mean().round(2)\n", " elif summarization == 'count':\n", " total_raw = df[measure].count()\n", "\n", " total = total_raw\n", " \n", " narrative_total = eval(f\"f'{template_total}'\")\n", " \n", " narrative = {\n", " f'Total {measure}': narrative_total\n", " } \n", " \n", " variables = {\n", " f'Total {measure}': {\n", " 'narrative_total': narrative_total,\n", " 'template_total': template_total,\n", " 'measure': measure,\n", " 'dimension_one': dimension_one,\n", " 'total': total,\n", " 'total_raw': total_raw\n", " }\n", " }\n", "\n", " # High-Level Narrative\n", " for dimension in dimensions:\n", "\n", " output = get_descriptive_outliers(\n", " df = df,\n", " dimension=dimension,\n", " measure=measure,\n", " # we need overall total for average only, in other cases it leads to incorrect output\n", " total = None if summarization in [\"sum\", \"count\"] else total_raw,\n", " summarization = summarization,\n", " coverage = coverage,\n", " coverage_limit = coverage_limit\n", " )\n", "\n", " if output is None:\n", " continue\n", "\n", " # Outputting all to the global env\n", " n_outliers = output['n_outliers']\n", " outlier_levels = output['outlier_levels']\n", " outlier_values = output['outlier_values']\n", " outlier_values_p = output['outlier_values_p']\n", "\n", " if summarization == 'average':\n", " outlier_insight = ', '.join([f\"{outlier_levels} ({outlier_values}, {outlier_values_p} vs average {measure})\" for outlier_levels, outlier_values, outlier_values_p in zip(outlier_levels, outlier_values, outlier_values_p)])\n", " else:\n", " outlier_insight = ', '.join([f\"{outlier_levels} ({outlier_values}, {outlier_values_p})\" for outlier_levels, outlier_values, outlier_values_p in zip(outlier_levels, outlier_values, outlier_values_p)])\n", "\n", " if n_outliers > 1:\n", " template_outlier_final = template_outlier_multiple\n", " template_selected = \"multiple\"\n", " else:\n", " template_outlier_final = template_outlier\n", " template_selected = \"single\"\n", "\n", " narrative_outlier_final = {\n", " f'{dimension} by {measure}': eval(f\"f'{template_outlier_final}'\")\n", " }\n", " \n", " narrative.update(narrative_outlier_final)\n", "\n", " if template_selected == 'single':\n", " variables_l1 = { \n", " f'{dimension} by {measure}': {\n", " 'narrative_outlier_final': narrative_outlier_final, \n", " 'template_outlier': template_outlier, \n", " 'dimension': dimension, \n", " 'measure': measure, \n", " 'outlier_insight': outlier_insight, \n", " 'n_outliers': n_outliers, \n", " 'outlier_levels': outlier_levels, \n", " 'outlier_values': outlier_values, \n", " 'outlier_values_p': outlier_values_p \n", " }\n", " }\n", "\n", " if template_selected == 'multiple':\n", " variables_l1 = { \n", " f'{dimension} by {measure}': {\n", " 'narrative_outlier_final': narrative_outlier_final, \n", " 'template_outlier_multiple': template_outlier_multiple, \n", " 'dimension': dimension, \n", " 'measure': measure, \n", " 'outlier_insight': outlier_insight, \n", " 'n_outliers': n_outliers, \n", " 'outlier_levels': outlier_levels, \n", " 'outlier_values': outlier_values, \n", " 'outlier_values_p': outlier_values_p \n", " }\n", " }\n", "\n", " variables.update(variables_l1)\n", " \n", " # Output\n", " if return_data == True:\n", " return(variables)\n", " \n", " if simplify == True:\n", " narrative = list(narrative.values())\n", " \n", " return(narrative)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# Description: This script contains helper functions for obtaining the frequency of the data\n", "\n", "def get_frequency(df, date_field=None):\n", " \"\"\"\n", " Get Date Frequency of the Data\n", " \"\"\"\n", " \n", " if not isinstance(df, pd.DataFrame):\n", " raise ValueError(\"'df' must be a pandas DataFrame\")\n", " \n", " if len(df) == 0:\n", " raise ValueError(\"'df' must have at least 1 row\")\n", " \n", " if date_field is None:\n", " date_fields = df.select_dtypes(include=[np.datetime64]).columns if not df.select_dtypes(include=[np.datetime64]).columns.empty else None\n", " \n", " if date_fields is None:\n", " raise ValueError(\"No date field detected in 'df'\")\n", " elif len(date_fields) > 1:\n", " raise ValueError(\"Multiple date fields detected in 'df', please specify 'date_field'\")\n", " else:\n", " date_field = date_fields[0]\n", " else:\n", " if date_field not in df.columns:\n", " raise ValueError(\"'date_field' must be present in the supplied data frame\")\n", " elif not np.issubdtype(df[date_field].dtype, np.datetime64):\n", " raise ValueError(\"'date_field' must be of datetime type\")\n", " \n", " df = df.rename(columns={date_field: \"date_field\"})\n", " \n", " est_frequency = df[\"date_field\"].diff().dt.days.abs().value_counts().idxmax()\n", " \n", " if est_frequency > 300:\n", " frequency = \"year\"\n", " elif est_frequency > 35:\n", " frequency = \"quarter\"\n", " elif est_frequency > 8:\n", " frequency = \"month\"\n", " elif est_frequency > 3:\n", " frequency = \"week\"\n", " else:\n", " frequency = \"day\"\n", " \n", " return frequency\n", "\n", "def get_py_date(df: pd.DataFrame, frequency: Optional[str] = None):\n", " \"\"\"\n", " Calculate the prior year date based on the maximum date in the DataFrame and the given frequency.\n", " \"\"\"\n", " # Table must be a pandas DataFrame and have at least one row\n", " if not isinstance(df, pd.DataFrame):\n", " raise ValueError(\"df must be a pandas DataFrame\")\n", " if df.shape[0] == 0:\n", " raise ValueError(\"df must have at least one row, execution is stopped\")\n", " \n", " date_field = df.select_dtypes(include=[np.datetime64]).columns[0] if not df.select_dtypes(include=[np.datetime64]).columns.empty else None\n", " \n", " if date_field is None:\n", " raise ValueError(\"Data frame must contain one date column\")\n", "\n", " # Calculating frequency if not available\n", " if frequency is None:\n", " frequency = get_frequency(df)\n", "\n", " max_date = df[date_field].max()\n", " max_year = max_date.year\n", "\n", " if frequency == \"week\":\n", " df_weekly = (\n", " df[[date_field]]\n", " .drop_duplicates()\n", " .sort_values(by=date_field)\n", " .assign(week=lambda x: x[date_field].dt.isocalendar().week,\n", " year=lambda x: x[date_field].dt.year)\n", " )\n", "\n", " max_week = df_weekly.loc[df_weekly[date_field] == max_date, \"week\"].iat[0]\n", "\n", " py_date = df_weekly.loc[(df_weekly[\"year\"] == max_year - 1) & (df_weekly[\"week\"] == max_week), date_field].values\n", " py_date = pd.to_datetime(py_date)\n", "\n", " if len(py_date) == 0:\n", " py_date = max_date - pd.DateOffset(years=1)\n", " else:\n", " py_date = py_date[0]\n", " else:\n", " py_date = max_date - pd.DateOffset(years=1)\n", "\n", " return py_date.date()\n", "\n", "\n", "def ytd_volume(\n", " df, \n", " measure = None, \n", " date = None, \n", " summarization = \"sum\", \n", " current_year = None, \n", " cy_date = None):\n", " \"\"\"\n", " Calculate the year-to-date (YTD) volume of a given measure in a pandas DataFrame.\n", " \"\"\"\n", " # Table must be a pandas DataFrame and have at least one row\n", " if not isinstance(df, pd.DataFrame):\n", " raise ValueError(\"df must be a pandas DataFrame\")\n", " if df.shape[0] == 0:\n", " raise ValueError(\"df must have at least one row, execution is stopped\")\n", " \n", " # Summarization Assertion\n", " if summarization not in [\"sum\", \"count\", \"average\"]:\n", " raise ValueError(\"summarization must of be one of: 'sum', 'count' or 'mean'.\")\n", " \n", " # Measure, Date and Dimensions Assertion\n", " if measure is not None:\n", " if measure not in df.columns:\n", " raise ValueError(\"measure must a column in the dataset\")\n", " else:\n", " # If measure isn't supplied get the first numerical column from it\n", " measure = df.select_dtypes(include=[np.number]).columns[0]\n", " \n", " # Get Date\n", " if date is not None:\n", " if date not in df.columns:\n", " raise ValueError(\"date must a column in the dataset\")\n", " \n", " df[date] = pd.to_datetime(df[date])\n", " \n", " if not pd.api.types.is_datetime64_any_dtype(df[date]):\n", " raise ValueError(\"'date' must be a date column in the dataset\")\n", " else:\n", " # Getting the first date field available\n", " date = df.select_dtypes(include=[np.datetime64]).columns[0] if not df.select_dtypes(include=[np.datetime64]).columns.empty else None\n", " \n", " if date is None:\n", " raise ValueError(\"No date column found in the dataset\")\n", "\n", " # Current Year's Date\n", " if cy_date is None:\n", " cy_date = df[date].max()\n", " else:\n", " cy_date = pd.to_datetime(cy_date)\n", " \n", " # Current year assertion\n", " if current_year is not None and current_year != cy_date.year:\n", " try:\n", " current_year = int(current_year)\n", " except:\n", " raise ValueError(\"current_year argument must be numeric or convertable to numeric like 2022 or '2022' \")\n", " else:\n", " current_year = cy_date.year\n", "\n", " cy_volume = (df.assign(year=df[date].dt.year)\n", " .query('year == @current_year and `{0}` <= @cy_date'.format(date))\n", " .agg({measure: summarization})\n", " .squeeze()\n", " )\n", " \n", " cy_volume = cy_volume.round(2)\n", " \n", " return cy_volume\n", "\n", "def pytd_volume(\n", " df, \n", " measure = None, \n", " date = None, \n", " summarization = \"sum\", \n", " current_year = None, \n", " py_date = None):\n", " \"\"\"\n", " Calculate the previous year-to-date (PYTD) volume of a given measure in a pandas DataFrame.\n", " \"\"\"\n", " # Table must be a pandas DataFrame and have at least one row\n", " if not isinstance(df, pd.DataFrame):\n", " raise ValueError(\"df must be a pandas DataFrame\")\n", " if df.shape[0] == 0:\n", " raise ValueError(\"df must have at least one row, execution is stopped\")\n", " \n", " # Summarization Assertion\n", " if summarization not in [\"sum\", \"count\", \"average\"]:\n", " raise ValueError(\"summarization must of be one of: 'sum', 'count' or 'mean'.\")\n", " \n", " # Measure, Date and Dimensions Assertion\n", " if measure is not None:\n", " if measure not in df.columns:\n", " raise ValueError(\"measure must a column in the dataset\")\n", " else:\n", " # If measure isn't supplied get the first numerical column from it\n", " measure = df.select_dtypes(include=[np.number]).columns[0]\n", " \n", " # Get Date\n", " if date is not None:\n", " if date not in df.columns:\n", " raise ValueError(\"date must a column in the dataset\")\n", " \n", " df[date] = pd.to_datetime(df[date])\n", " \n", " if not pd.api.types.is_datetime64_any_dtype(df[date]):\n", " raise ValueError(\"'date' must be a date column in the dataset\")\n", " else:\n", " # Getting the first date field available\n", " date = df.select_dtypes(include=[np.datetime64]).columns[0] if not df.select_dtypes(include=[np.datetime64]).columns.empty else None\n", " \n", " if date is None:\n", " raise ValueError(\"No date column found in the dataset\")\n", "\n", " # Current Year's Date\n", " if py_date is None:\n", " py_date = get_py_date(df)\n", " else:\n", " py_date = pd.to_datetime(py_date)\n", " \n", " # Current year assertion\n", " if current_year is not None and current_year - 1 != py_date.year:\n", " try:\n", " current_year = int(current_year)\n", " except:\n", " raise ValueError(\"current_year argument must be numeric or convertable to numeric like 2022 or '2022' \")\n", " else:\n", " previous_year = py_date.year\n", "\n", " py_volume = (df.assign(year=df[date].dt.year)\n", " .query('year == @previous_year and `{0}` <= @py_date'.format(date))\n", " .agg({measure: summarization})\n", " .squeeze()\n", " )\n", " \n", " py_volume = py_volume.round(2)\n", "\n", " return py_volume" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# Description: This script contains helper functions for identifying outliers in the data\n", "\n", "def get_trend_outliers(\n", " df, \n", " dimension, \n", " measure, \n", " total = None, \n", " summarization = \"sum\", \n", " coverage = 0.5, \n", " coverage_limit = 5):\n", " \"\"\"\n", " Returns trend outliers based on a given dataframe, dimension, and measure.\n", " \"\"\"\n", " grouped = df.groupby(dimension)\n", "\n", " # Define a function to apply to each group\n", " def process_group(group, ytd_volume, pytd_volume, summarization, measure):\n", " curr_volume = ytd_volume(group, summarization=summarization, measure=measure)\n", " prev_volume = pytd_volume(group, summarization=summarization, measure=measure)\n", " change = curr_volume - prev_volume\n", " change_p = f\"{round(change / prev_volume * 100, 2)}%\"\n", " abs_change = abs(change)\n", " trend = \"increase\" if change > 0 else \"decrease\"\n", " \n", " output = pd.Series({\n", " \"curr_volume\": curr_volume,\n", " \"prev_volume\": prev_volume,\n", " \"change\": change,\n", " \"change_p\": change_p,\n", " \"abs_change\": abs_change,\n", " \"trend\": trend,\n", " })\n", " \n", " return output\n", "\n", " # Apply the function to each group and create a new DataFrame\n", " table = grouped.apply(process_group, ytd_volume, pytd_volume, summarization, measure)\n", "\n", " # Reset the index and sort by abs_change\n", " table = table.reset_index().sort_values(by=\"abs_change\", ascending=False)\n", "\n", " if summarization in [\"sum\", \"count\"]:\n", " if total is None:\n", " total = df[measure].sum().round(2)\n", "\n", " table = (\n", " table.assign(share=lambda x: x['abs_change'] / x['abs_change'].sum())\n", " .assign(cum_share=lambda x: x['share'].cumsum())\n", " .assign(lag_cum_share=lambda x: x['cum_share'].shift(fill_value=False))\n", " ).reset_index()\n", "\n", " table = table[table['lag_cum_share'] < coverage]\n", " table = table.head(coverage_limit)\n", "\n", " if df.shape[0] == 1 and table['cum_share'].iloc[0] == 1:\n", " return None\n", "\n", " elif summarization == 'average':\n", " if total is None:\n", " total = table[measure].mean().round(2)\n", "\n", " table = (table\n", " .assign(share = lambda x: x[measure]/total - 1)\n", " .assign(abs_share=lambda x: x['share'].abs())\n", " .sort_values(by='abs_share', ascending=False)\n", " .assign(cum_share=lambda x: x['share'].abs().cumsum()/(x['share'].max() - x['share'].min()))\n", " .loc[lambda x: (x['cum_share'] >= coverage*2).shift(fill_value=False).cumsum() == 0]\n", " .iloc[:coverage_limit]\n", " )\n", "\n", " n_outliers = table.shape[0]\n", " outlier_levels = table[dimension].astype(str).values.tolist()\n", " outlier_values = table[measure].round(1).values.tolist()\n", " outlier_values_p = (table[\"share\"].round(2) * 100).astype(str).add(\"%\").values.tolist()\n", "\n", " output = {\n", " \"n_outliers\": n_outliers,\n", " \"outlier_levels\": outlier_levels,\n", " \"outlier_values\": outlier_values,\n", " \"outlier_values_p\": outlier_values_p\n", " }\n", "\n", " return output\n", "\n", "\n", "# Monthly Trend Analysis\n", "\n", "def monthly_trend(df, measure, dimensions, date_column='Date', summarization='sum', top_n=3):\n", " # Ensure the date column is in datetime format\n", " df[date_column] = pd.to_datetime(df[date_column])\n", " df['Year'] = df[date_column].dt.year\n", " df['Month'] = df[date_column].dt.month\n", "\n", " # Group by dimensions, Year, and Month\n", " group_by_columns = dimensions + ['Year', 'Month']\n", " if summarization == 'sum':\n", " summary_func = np.sum\n", " elif summarization == 'average':\n", " summary_func = np.mean\n", " elif summarization == 'count':\n", " summary_func = 'count'\n", " else:\n", " raise ValueError(\"Invalid summarization method. Choose from 'sum', 'average', 'count'.\")\n", "\n", " summary = df.groupby(group_by_columns)[measure].agg(summary_func).reset_index(name='Measure')\n", " summary.sort_values(by=group_by_columns, inplace=True)\n", " summary['Change'] = summary.groupby(dimensions)['Measure'].diff().fillna(0)\n", " summary['Percentage_Change'] = summary.groupby(dimensions)['Measure'].pct_change().fillna(0) * 100\n", "\n", " narrative = {}\n", " for dimension in dimensions:\n", " dimension_summary = summary.groupby([dimension, 'Year', 'Month']).agg({'Change': 'sum', 'Percentage_Change': 'mean'}).reset_index()\n", " # Top N positive changes\n", " top_increases = dimension_summary.nlargest(top_n, 'Change')\n", " # Top N negative changes\n", " top_decreases = dimension_summary.nsmallest(top_n, 'Change')\n", "\n", " narrative[dimension + ' Increases'] = '; '.join([f\"{row[dimension]}, {row['Year']}-{row['Month']:02d}: Change={row['Change']:.2f}, ({row['Percentage_Change']:.2f}%)\" for index, row in top_increases.iterrows()])\n", " narrative[dimension + ' Decreases'] = '; '.join([f\"{row[dimension]}, {row['Year']}-{row['Month']:02d}: Change={row['Change']:.2f}, ({row['Percentage_Change']:.2f}%)\" for index, row in top_decreases.iterrows()])\n", "\n", " # Overall summary\n", " total_increase_months = summary[summary['Change'] > 0].shape[0]\n", " total_decrease_months = summary[summary['Change'] < 0].shape[0]\n", " narrative['Overall Summary'] = f\"Total months with increase: {total_increase_months}, Total months with decrease: {total_decrease_months}\"\n", "\n", " return narrative" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# Configure the OpenAI API client with the base URL for Ollama\n", "from openai import OpenAI\n", "openai_api_key = 'ollama'\n", "\n", "# Instantiate the OpenAI client with the base_url for Ollama\n", "client = OpenAI(\n", " api_key=openai_api_key, \n", " base_url='http://localhost:11434/v1' # Ollama's base URL\n", ")\n", "\n", "def gpt_get_completions(\n", " prompt,\n", " openai_api_key=openai_api_key \n", "):\n", "\n", " # Set up the OpenAI API client with the messages and model information\n", " chat_completion = client.chat.completions.create(\n", " messages=[\n", " {\"role\": \"system\", \"content\": \"You are an expert business intelligence data analyst.\"},\n", " {\"role\": \"user\", \"content\": prompt}\n", " ],\n", " model=\"llama3:latest\", # Replace with your desired model - llama2:latest is the default\n", " )\n", "\n", " # Return the content of the first message in the completions\n", " return chat_completion.choices[0].message.content\n", "\n", "def enhance_narrative(\n", " narrative,\n", " openai_api_key=openai_api_key\n", "):\n", " prompt = f'Provide a detailed overview of the information provided to executive leadership and make sure to highlight any outliers or anomalies in the data: \"{narrative}\"'\n", " output = gpt_get_completions(prompt, openai_api_key=openai_api_key)\n", " return(output)\n", "\n", "def translate_narrative(\n", " narrative,\n", " language,\n", " openai_api_key=openai_api_key\n", "):\n", " prompt = f'Using professional language translate the following text to {language}: \"{narrative}\"'\n", " output = gpt_get_completions(prompt, openai_api_key=openai_api_key)\n", " return(output)\n", "\n", "def summarize_narrative(\n", " narrative,\n", " openai_api_key=openai_api_key\n", "):\n", " prompt = f'Summarize the following narrative to make it shorter: \"{narrative}\"'\n", " output = gpt_get_completions(prompt, openai_api_key=openai_api_key)\n", " return(output)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# Read the dataset into a pandas DataFrame\n", "filename = 'sales.csv'\n", "\n", "def read_data():\n", " return pd.read_csv(filename, encoding='latin-1', keep_default_na=False)\n", "\n", "data = read_data()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Order IDDateRegionProductStorePromotionPriceQuantitySales
0QR102761/30/2019EuropeToolsBig189.6282007.04
1QR152457/26/2020AsiaToolsBig0268.571879.50
2QR139383/5/2020EuropeHomeBig0223.3184019.40
3QR1593410/8/2020South AmericaClothingBig0210.4275680.80
4QR109634/15/2019EuropeToolsBig0193.1224248.20
\n", "
" ], "text/plain": [ " Order ID Date Region Product Store Promotion Price \\\n", "0 QR10276 1/30/2019 Europe Tools Big 1 89.6 \n", "1 QR15245 7/26/2020 Asia Tools Big 0 268.5 \n", "2 QR13938 3/5/2020 Europe Home Big 0 223.3 \n", "3 QR15934 10/8/2020 South America Clothing Big 0 210.4 \n", "4 QR10963 4/15/2019 Europe Tools Big 0 193.1 \n", "\n", " Quantity Sales \n", "0 28 2007.04 \n", "1 7 1879.50 \n", "2 18 4019.40 \n", "3 27 5680.80 \n", "4 22 4248.20 " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Set up narratives\n", "sales_narrative = narrate_descriptive(data, measure='Sales', dimensions=['Region', 'Product'], return_data=False, coverage=.5)\n", "price_narrative = narrate_descriptive(data, measure='Price', dimensions=['Region', 'Product'], return_data=False, coverage=.5)\n", "trend_narrative = monthly_trend(data, measure='Sales', dimensions=['Region', 'Product'], date_column='Date')\n", "\n", "# Enhance the narratives using locally hosted Ollama LLM model\n", "sales_results = enhance_narrative(sales_narrative)\n", "price_results = enhance_narrative(price_narrative)\n", "trend_results = enhance_narrative(trend_narrative)\n", "\n", "# Print the enhanced narratives\n", "print(\"Sales Narrative:\")\n", "print(sales_results)\n", "print(\"\\nPrice Narrative:\")\n", "print(price_results)\n", "print(\"\\nTrend Narrative:\")\n", "print(trend_results)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Llama 3 Results\n", "\n", "Sales Narrative:\n", "To executive leadership, I am pleased to present the key findings from our recent analysis of sales data across all regions and products.\n", "\n", "**Total Sales**: Our analysis reveals that total sales across all regions reached an impressive $38,790,478.42. This is a significant milestone for the company, indicating strong demand for our products globally.\n", "\n", "**Regional Outliers**: Upon examining regional sales data, we have identified two outliers: North America and Europe. These regions account for a significant proportion of overall sales, with North America contributing 47% ($18,073,636.4) and Europe contributing 35% ($13,555,127.7). This suggests that our products are highly popular in these regions, which is an important insight for our market strategy.\n", "\n", "**Product Outliers**: Our analysis also reveals two outliers among product categories: Food & Beverage (40%, $15,543,697) and Electronics (22%, $8,609,628.8). These high-performing products could be leveraged to drive growth and expansion into new markets. The dominance of Food & Beverage and Electronics in terms of sales suggests that these products are highly appealing to customers.\n", "\n", "**Key Insights**: Based on our analysis, I would like to highlight the following key insights for executive leadership:\n", "\n", "1. **Regional Focus**: Our data indicates a significant focus on North America and Europe. This could inform future market expansion strategies and resource allocation.\n", "2. **Product Portfolio**: The performance of Food & Beverage and Electronics products suggests that these categories should be prioritized in terms of product development, marketing, and sales efforts.\n", "3. **Cross-Regional Analysis**: A cross-regional analysis could provide valuable insights into customer preferences, purchasing habits, and market trends, helping us refine our marketing strategies.\n", "\n", "In conclusion, the data presents a compelling story about the strengths of our business. By focusing on high-performing regions and products, we can optimize our growth strategy, drive innovation, and increase revenue.\n", "\n", "Price Narrative:\n", "**Executive Summary:**\n", "\n", "Our analysis of the company's sales data reveals several key insights that can inform business decisions and drive growth.\n", "\n", "**Total Sales:** The total price across all regions is $17,310.10, indicating a strong overall performance.\n", "\n", "**Regional Outliers:** Two regions stand out as significant outliers in terms of their total sales: North America ($7,390.31, 43.0%) and Europe ($5,569.86, 32.0%). These regions account for nearly three-quarters of the company's total sales, suggesting a strong presence in these markets.\n", "\n", "**Product Outliers:** Similarly, two products are notable outliers in terms of their sales: Food & Beverage ($6,099.01.5, 35.0%) and Electronics ($34,428.6, 20.0%). These products contribute significantly to the company's overall revenue, indicating a strong focus on these categories.\n", "\n", "**Key Takeaways:**\n", "\n", "1. **Regional Focus:** Our analysis suggests that the company should maintain its strong presence in North America and Europe, as these regions are driving a significant portion of sales.\n", "2. **Product Strategy:** The data highlights the importance of Food & Beverage and Electronics products, which may warrant further investment in marketing, product development, or customer outreach to capitalize on their growth potential.\n", "3. **Growth Opportunities:** The outliers identified in our analysis suggest that there may be untapped opportunities for growth in these regions and product categories.\n", "\n", "**Recommendations:**\n", "\n", "1. Conduct a deeper analysis of the North American and European markets to identify specific trends and opportunities for growth.\n", "2. Develop targeted marketing campaigns to capitalize on the demand for Food & Beverage and Electronics products.\n", "3. Consider expanding the company's product offerings or services to cater to the growing demand in these categories.\n", "\n", "By leveraging these insights, executive leadership can inform data-driven decisions that drive business growth and profitability.\n", "\n", "Trend Narrative:\n", "Executive Leadership Overview:\n", "\n", "The provided data highlights significant changes in regional and product sales trends over the past two years. The key findings are as follows:\n", "\n", "**Regional Increases:**\n", "\n", "* Europe saw a notable increase of €330,688.60 (93.18%) during December 2019.\n", "* North America experienced substantial growth with increases of $32,9413.50 (214.14%) in November 2019 and $30,3998.84 (160.89%) in September 2019.\n", "\n", "**Regional Decreases:**\n", "\n", "* Europe suffered a decline of €50,1492.88 (-66.75%) during January 2020.\n", "* North America experienced decreases of $46,9053.12 (-53.10%) and $44,3644.18 (-53.63%) in January 2021 and 2020, respectively.\n", "\n", "**Product Increases:**\n", "\n", "* Food & Beverage saw significant growth with increases of $249,176.08 (88.60%) during October 2020.\n", "* Electronics experienced a notable increase of $220,380.30 (258.69%) in December 2019.\n", "* Food & Beverage also had an increase of $215,149.88 (68.62%) in September 2019.\n", "\n", "**Product Decreases:**\n", "\n", "* Food & Beverage suffered decreases of $479,361.32 (-65.18%) and $387,654.16 (-56.22%) during January 2020 and 2021, respectively.\n", "* Electronics experienced a decline of $26,2982.74 (-61.64%) in January 2020.\n", "\n", "**Overall Summary:**\n", "\n", "* Total months with increase: 433\n", "* Total months with decrease: 401\n", "\n", "Key Observations:\n", "\n", "1. **Europe's rollercoaster ride:** Europe had significant growth in December 2019 but suffered a decline in January 2020.\n", "2. **North America's consistent growth:** North America experienced substantial growth throughout 2019, but then saw declines in January 2020 and 2021.\n", "3. **Food & Beverage's mixed performance:** Food & Beverage showed strong growth in October 2020, but also suffered significant declines in January 2020 and 2021. Electronics, on the other hand, experienced a notable increase in December 2019, followed by a decline in January 2020.\n", "\n", "Outliers or anomalies in the data:\n", "\n", "1. **North America's exceptional growth in November 2019:** The region saw a staggering 214.14% increase, which may be an anomaly compared to the other months.\n", "2. **Europe's significant decline in January 2020:** This decline of €50,1492.88 (-66.75%) is unusual compared to the overall trend.\n", "\n", "Recommendations:\n", "\n", "1. **Conduct further analysis on North America's November 2019 growth:** Investigate the underlying causes for this exceptional growth and consider whether it was an isolated event or part of a broader trend.\n", "2. **Analyze Europe's January 2020 decline:** Identify the factors contributing to this significant decline and consider potential corrective measures.\n", "3. **Monitor Food & Beverage's performance:** Keep a close eye on the product's trends, as its mixed performance may indicate underlying issues that need attention.\n", "\n", "By highlighting these key findings, outliers, and recommendations, we can inform executive leadership of the data's implications and guide business decisions to optimize regional and product performance." ] } ], "metadata": { "kernelspec": { "display_name": "ollama", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.5" } }, "nbformat": 4, "nbformat_minor": 2 }