{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction to the MIT Supercloud Dataset\n", "\n", "This notebook is an introduction to working with the MIT Supercloud Dataset. It introduces the types of data collected and ways to load, process, and plot the data.\n", "\n", "Details of the dataset can be found in [The MIT Supercloud Dataset](https://arxiv.org/abs/2108.02037)." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import os\n", "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Functions" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "def plot_time_series(df=None,\n", " columns=None,\n", " downsample=1,\n", " samples_per_second=1,\n", " title=None):\n", " \n", " \"\"\"\n", " Plot CPU or GPU time series data\n", " \n", " Inputs:\n", " \n", " df: timeseries pandas dataframe\n", " columns: columns from timeseries to print\n", " downsample: number of samples to skip between each plotted sample\n", " samples_per_second: number of samples collected per second \n", " title: string for plot title\n", " \n", " \"\"\"\n", "\n", " # time index. CPU time series are sampled every 10 seconds, GPU every tenth of a second\n", " t = np.linspace(0,df.shape[0]*(samples_per_second**-1),df.shape[0])[::downsample]\n", "\n", " # colors\n", " cm = plt.get_cmap('tab10')\n", " num_colors = df.columns.shape[0]\n", " colors = [cm(1.*i/num_colors) for i in range(num_colors)]\n", "\n", " # figure\n", " fig, axs = plt.subplots(3,3,figsize=(16,16))\n", " plt.suptitle(title,fontsize=14)\n", "\n", " # loop over columns to plot\n", " for ax,column,color in zip(axs.ravel(),columns,colors):\n", " plot_data = df[column].values[::downsample]\n", " ax.plot(t,plot_data,color=color)\n", " ax.tick_params(axis='x',rotation=-45)\n", " ax.set_xlabel('Time (s)')\n", " ax.set_ylabel(column)\n", " ax.grid()\n", " plt.show()\n", " plt.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Paths" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# This path points to the root directory where the data was extracted\n", "ROOT_PATH = 'PATH/TO/DATASET/LOCATION'\n", "\n", "# The paths below point to specific files or directories\n", "SCHEDULER_LOG_PATH = os.path.join(ROOT_PATH,'scheduler-log.csv') # slurm log csv\n", "NODE_DATA_PATH = os.path.join(ROOT_PATH,'node-data.csv') # node data csv\n", "CPU_DATA_PATH = os.path.join(ROOT_PATH,'cpu') # cpu time series directory\n", "GPU_DATA_PATH = os.path.join(ROOT_PATH,'gpu') # gpu time series directory" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Slurm Log" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# slurm log dataframe\n", "scheduler_log_df = pd.read_csv(SCHEDULER_LOG_PATH)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Columns for Scheduler log dataframe:\n", "\n", "id_job\n", "id_array_job\n", "id_array_task\n", "id_user\n", "kill_requid\n", "nodes_alloc\n", "nodelist\n", "cpus_req\n", "derived_ec\n", "exit_code\n", "gres_used\n", "array_max_tasks\n", "array_task_pending\n", "constraints\n", "flags\n", "mem_req\n", "partition\n", "priority\n", "state\n", "timelimit\n", "time_submit\n", "time_eligible\n", "time_start\n", "time_end\n", "time_suspended\n", "track_steps\n", "tres_alloc\n", "tres_req\n", "job_type\n" ] } ], "source": [ "# columns in slurm log dataframe\n", "print('Columns for Scheduler log dataframe:\\n')\n", "print(\"\\n\".join([str(i) for i in scheduler_log_df.columns]))" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "There are 395914 jobs in the scheduler log, of which 98176 requested GPUs.\n" ] } ], "source": [ "# job IDs in the slurm log\n", "scheduler_log_job_ids = scheduler_log_df.id_job.unique()\n", "\n", "# indices of gpu jobs\n", "gpu_idx = scheduler_log_df.tres_req.apply(lambda x:str(x).find('1001')>0 or str(x).find('1002')>0)\n", "scheduler_log_job_ids_gpu = np.unique(scheduler_log_df[gpu_idx].id_job.values)\n", "\n", "print('There are {} jobs in the scheduler log, of which {} requested GPUs.'.format(scheduler_log_job_ids.shape[0],\n", " scheduler_log_job_ids_gpu.shape[0]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Node Data\n", "Explore the data colleced from each compute node on the system." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "# node data dataframe\n", "node_data_df = pd.read_csv(NODE_DATA_PATH)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Columns for Node data dataframe:\n", "\n", "Node\n", "Time\n", "UserPIDCount\n", "FSlatency\n", "LoadAvg\n", "MemoryFreeInactiveKB\n", "LustreRPCTotals\n" ] } ], "source": [ "# # columns in slurm log dataframe\n", "print('Columns for Node data dataframe:\\n')\n", "print(\"\\n\".join([str(i) for i in node_data_df.columns]))" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | Node | \n", "Time | \n", "UserPIDCount | \n", "FSlatency | \n", "LoadAvg | \n", "MemoryFreeInactiveKB | \n", "LustreRPCTotals | \n", "
---|---|---|---|---|---|---|---|
0 | \n", "r7217787-n911952 | \n", "1.614557e+09 | \n", "91472915699408:11|1706828023724:15|65855960046... | \n", "0.0 | \n", "29.24 | \n", "371.0 | \n", "26359.0 | \n", "
1 | \n", "r4858666-n911952 | \n", "1.614557e+09 | \n", "66720169194922:40| | \n", "0.0 | \n", "4.07 | \n", "363.0 | \n", "228.0 | \n", "
2 | \n", "r2582019-n911952 | \n", "1.614557e+09 | \n", "22654259079669:47| | \n", "0.0 | \n", "40.01 | \n", "377.0 | \n", "2006.0 | \n", "
3 | \n", "r9040233-n911952 | \n", "1.614557e+09 | \n", "91472915699408:7|12886809117418:29|53679664603... | \n", "0.0 | \n", "29.58 | \n", "369.0 | \n", "8289.0 | \n", "
4 | \n", "r4229531-n911952 | \n", "1.614557e+09 | \n", "15914930715133:5| | \n", "0.0 | \n", "0.26 | \n", "390.0 | \n", "227.0 | \n", "