# SysFlow support for Kubernetes environments

Starting with version 0.5.0, SysFlow records contain more information related to containers in case they are part of a Kubernetes (k8s) or OpenShift environment. Specifically, there is a new record type, `KE`, that captures and exposes the Kubernetes events in the new `k8s.*` attributes. Furthermore, all other record types are extended with Kubernetes pod information in the new `pod.*` attributes.

These new attributes are, in more detail:

<pre>
k8s.action                     K8s Event Action
k8s.kind                       K8s Event Component Type
k8s.msg                        K8s Event Message

pod.id                         Pod Identifier
pod.name                       Pod Name
pod.nname                      Pod Node Name
pod.hostip                     Pod Host IP
pod.internalip                 Pod Internal IP
pod.ns                         Pod Namespace
pod.rstrtcnt                   Pod Restart Count
pod.services                   Pod Services

</pre>

In this notebook we will look into the new information using data from a test setup running Instana's [robot-shop application](https://github.com/instana/robot-shop), with a special eye to the new information that is available with respect to cluster-relevant IP addresses.

First, we describe the experimental setup that was used to create our test data. Next, after loading the experimental SysFlow data, we look at the new k8s event data, the new information related to pods as avaible per SysFlow record, and compare the newly available cluster-level IP address that augment the observed network activity in the regular NF objects.

# Experimental setup and timeline

## Setup

The experimental setup is based on the installation of: 
- minikube v1.25.2 on Ubuntu 18.04 (kvm/amd64) with
- kubernetes version v1.23.3 using the
- virtualbox driver
as a small base Kubernetes test environment.

The experiment consists in the installation of Instana's [robot-shop application](https://github.com/instana/robot-shop). To have sufficient resources to run this multi-container application, slightly more than the default minimal configuration should be used, e.g., here we are using 4 CPUs and 16GB of memory for a virtualbox VM.

```
$ minikube start
* minikube v1.25.2 on Ubuntu 18.04 (kvm/amd64)
* Using the virtualbox driver based on user configuration
* Starting control plane node minikube in cluster minikube
* Creating virtualbox VM (CPUs=4, Memory=16000MB, Disk=100000MB) ...
* Preparing Kubernetes v1.23.3 on Docker 20.10.12 ...
  - kubelet.housekeeping-interval=5m
  - Generating certificates and keys ...
  - Booting up control plane ...
  - Configuring RBAC rules ...
  - Using image gcr.io/k8s-minikube/storage-provisioner:v5
* Verifying Kubernetes components...
* Enabled addons: storage-provisioner, default-storageclass
* Done! kubectl is now configured to use "minikube" cluster and "default" namespace by default
```

The next section contains information about the experiment used to collect the SysFlow data with `sf-collector` 0.5.0.

## Timeline of the experiment

The experiment used to create our data consists of:
- the creation of the namespace `robot-shop`
- the installation of the the `robot-shop` with helm charts provided by the application
- the deletion of some of its containers (`mongodb`, `web`, `user`) - which will be automatically recreated by the application
- deinstall and cleanup of the `robot-shop` application.

The scripted experiment logs the timestamps of these events so that we have a baseline of the events to compare with the collected SysFlow data.

In [1]:
from sysflow.reader import FlattenedSFReader, SFReader
from sysflow.formatter import SFFormatter
import json
import os
import pprint
import pickle
import gzip
import pandas as pd
import numpy as np
import datetime
import tabulate
import textwrap
import plotly.graph_objects as go
import plotly as pl
import plotly.io as pio
pio.renderers.default = 'iframe'
pd.set_option('display.max_rows', 50)

In [2]:
data_dir = 'data/'

In [3]:
log_file = data_dir + 'experiment.log'
log_content_selection = ['----- ', 'waiting']

In [4]:
log_selected_lines = []
with open(log_file, 'r') as inp:
    for line in inp: 
        if any([p in line for p in log_content_selection]):
            log_selected_lines.append(line.rstrip())
log_events = []
for line in log_selected_lines:
    time_str = ' '.join(line.split()[0:2])
    tdt = datetime.datetime.strptime(time_str, '%Y-%m-%d %H:%M:%S,%f')  #.replace(tzinfo=localtz)    
    rest = ' '.join(line.split()[4:])
    event = rest.replace('----- ', '').replace('... ', '').replace(' seconds', 's')
    log_events.append([tdt, event])
print(tabulate.tabulate(log_events))

--------------------------  -------------------------
2022-03-17 18:48:40.736000  starting experiment
2022-03-17 18:48:40.736000  create project robot-shop
2022-03-17 18:48:40.991000  waiting for 60s
2022-03-17 18:49:41.047000  install robot shop
2022-03-17 18:49:42.531000  waiting for 900s
2022-03-17 19:04:47.423000  kill container mongodb
2022-03-17 19:04:53.940000  waiting for 300s
2022-03-17 19:09:53.947000  kill container web
2022-03-17 19:10:02.620000  waiting for 300s
2022-03-17 19:15:02.660000  kill container user
2022-03-17 19:15:36.016000  waiting for 300s
2022-03-17 19:20:36.117000  delete robot shop
2022-03-17 19:20:37.534000  waiting for 300s
2022-03-17 19:25:37.627000  delete project robot-shop
2022-03-17 19:25:50.305000  waiting for 300s
2022-03-17 19:30:50.346000  experiment ends
--------------------------  -------------------------


# Experimental SysFlow data

The collected SysFlow data is combined into the accompanying `experiment.sf` (as SysFlow trace files are essentially AVRO files, AVRO tools like `avro-tools concat` can be used to combine multiple SysFlow traces into one file).
This file is read into a Pandas DataFrame for our further evaluation.

In [5]:
sf_file = data_dir + 'experiment.sf'

In [6]:
# reading of the SysFlow trace file and conversion to a Pandas DataFrame (with caching onto disk)
df_file = data_dir + 'experiment_df.pkl.gz'
if os.path.exists(df_file):
    with gzip.open(df_file, 'rb') as inp:
        df = pickle.load(inp)
else:    
    reader = FlattenedSFReader(sf_file, False)
    formatter = SFFormatter(reader)
    df = formatter.toDataframe()
    # applying some functions to allow for hashing of the more complex data types
    df['pod.internalip'] = df['pod.internalip'].apply(tuple)
    df['pod.hostip'] = df['pod.hostip'].apply(tuple)
    df['pod.services_str'] = df['pod.services'].apply(str)
    with gzip.open(df_file, 'wb') as out:
        pickle.dump(df, out)

In [7]:
print(f'The captured data contains {df.shape[0]} SysFlow records, describing the activity of {len(df["container.id"].unique())} containers.')

The captured data contains 169324 SysFlow records, describing the activity of 41 containers.


# Kubernetes Events: the new `KE` record type

Let us first look at the new `KE` record type. For this, we subselect the entries of interest into a new DataFrame `df_ke`.

In [8]:
# k8s.msg fields have still some spurious line-ending
def fix_k8s_msg(msg):
    if msg.endswith('\n\u0000'):
        msg = msg[:-2]
    return msg

In [9]:
# select the KE records, drop all irrelevant columns (empty string or NaN)
df_ke = df[df.type == 'KE'].replace('', np.nan).dropna(axis=1, how='all').reset_index()
# fix k8s.msg
df_ke['k8s.msg'] = df_ke['k8s.msg'].apply(fix_k8s_msg)

The relevant information gathered from the events is shown in the fields:
- `k8s.kind`:   the kind of the K8s infrastructure that this event is concerned with (like "K8S_NODES", "K8S_NAMESPACES", "K8S_PODS", "K8S_REPLICATIONCONTROLLERS", "K8S_SERVICES", "K8S_EVENTS", "K8S_REPLICASETS", "K8S_DAEMONSETS", "K8S_DEPLOYMENTS", "K8S_UNKNOWN")
- `k8s.action`: the action type (like "K8S_COMPONENT_ADDED", "K8S_COMPONENT_MODIFIED", "K8S_COMPONENT_DELETED", "K8S_COMPONENT_ERROR", "K8S_COMPONENT_NONEXISTENT", "K8S_COMPONENT_UNKNOWN")
- `k8s.msg`:    the JSON string of the K8s event


In [10]:
df_ke

Unnamed: 0,index,version,type,ts,ts_uts,pod.hostip,pod.internalip,node.id,node.ip,filename,schema,tags,k8s.action,k8s.kind,k8s.msg
0,0,4,KE,2022-03-17T18:47:15.845631,1647542835845631000,(),(),minikube,192.168.59.100,/mnt/data/1647542836,4,(),K8S_COMPONENT_ADDED,K8S_NODES,"{""apiVersion"":""v1"",""items"":[{""addresses"":[""192..."
1,1,4,KE,2022-03-17T18:47:15.845631,1647542835845631000,(),(),minikube,192.168.59.100,/mnt/data/1647542836,4,(),K8S_COMPONENT_ADDED,K8S_NAMESPACES,"{""apiVersion"":""v1"",""items"":[{""labels"":{""kubern..."
2,2,4,KE,2022-03-17T18:47:15.845631,1647542835845631000,(),(),minikube,192.168.59.100,/mnt/data/1647542836,4,(),K8S_COMPONENT_ADDED,K8S_PODS,"{""apiVersion"":""v1"",""items"":[{""containerStatuse..."
3,3,4,KE,2022-03-17T18:47:15.845631,1647542835845631000,(),(),minikube,192.168.59.100,/mnt/data/1647542836,4,(),K8S_COMPONENT_ADDED,K8S_REPLICATIONCONTROLLERS,"{""apiVersion"":""v1"",""items"":[],""kind"":""Replicat..."
4,4,4,KE,2022-03-17T18:47:15.845631,1647542835845631000,(),(),minikube,192.168.59.100,/mnt/data/1647542836,4,(),K8S_COMPONENT_ADDED,K8S_SERVICES,"{""apiVersion"":""v1"",""items"":[{""clusterIP"":""10.9..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
168,157707,4,KE,2022-03-17T19:25:50.692444,1647545150692443689,(),(),minikube,192.168.59.100,/mnt/data/1647545115,4,(),K8S_COMPONENT_MODIFIED,K8S_NAMESPACES,"{""apiVersion"":""v1"",""items"":[{""labels"":{""kubern..."
169,157708,4,KE,2022-03-17T19:25:50.692444,1647545150692443689,(),(),minikube,192.168.59.100,/mnt/data/1647545115,4,(),K8S_COMPONENT_DELETED,K8S_NAMESPACES,"{""apiVersion"":""v1"",""items"":[{""labels"":{""kubern..."
170,159405,4,KE,2022-03-17T19:26:56.746972,1647545216746972249,(),(),minikube,192.168.59.100,/mnt/data/1647545175,4,(),K8S_COMPONENT_MODIFIED,K8S_NODES,"{""apiVersion"":""v1"",""items"":[{""addresses"":[""192..."
171,164836,4,KE,2022-03-17T19:30:57.880767,1647545457880767012,(),(),minikube,192.168.59.100,/mnt/data/1647545415,4,(),K8S_COMPONENT_ADDED,K8S_NODES,"{""apiVersion"":""v1"",""items"":[{""addresses"":[""192..."


As to be expected given the experiment, the largest activity can be seen around changes in the Pods:

In [11]:
df_ke.value_counts(['k8s.kind', 'k8s.action'], sort=False)

k8s.kind                    k8s.action            
K8S_NAMESPACES              K8S_COMPONENT_ADDED       13
                            K8S_COMPONENT_DELETED      1
                            K8S_COMPONENT_MODIFIED     3
K8S_NODES                   K8S_COMPONENT_ADDED        3
                            K8S_COMPONENT_MODIFIED     9
K8S_PODS                    K8S_COMPONENT_ADDED       24
                            K8S_COMPONENT_DELETED     15
                            K8S_COMPONENT_MODIFIED    77
K8S_REPLICATIONCONTROLLERS  K8S_COMPONENT_ADDED        1
K8S_SERVICES                K8S_COMPONENT_ADDED       15
                            K8S_COMPONENT_DELETED     12
dtype: int64

## Unpacking `k8s.msg` data

A deeper understanding of what the KE records tell us about the cluster activity can be found when expanding the `k8s.msg` field of the records.

The JSON-formatted k8s.msg contains a list of items to which the event relates. Usually this is only one item, but in some cases, the event is related to multiple items, e.g., when multiple items of the same type are added or deleted.

For this reason, we create a new DataFrame, where events are potentially duplicated for each item if there are multiple. Special consideration is given to extract IP releated data out of `k8s.msg` where avaible. The resulting information is stored in the new DataFrame `df_ke_ext`.

In [12]:
table = []
itemcols = ['name', 'namespace', 'podIP', 'hostIP', 'clusterIP']
for ie,e in df_ke.iterrows():
    msg  = json.loads(e['k8s.msg'])
    for item in msg['items']:
        d = e.to_dict()
        d['msg_hash'] = hash(str(msg))
        d['kind'] = msg['kind']
        d['typ'] = msg['type']
        d['name'] = item.get('name')
        d['namespace'] = item.get('namespace')
        d['ts_item'] = item.get('timestamp')
        if item.get('podIP'):
            d['ip'] = item.get('podIP')
            d['iptype'] = 'podIP'
            # d['podIP'] = item.get('podIP')
            table.append(d)
        elif item.get('hostIP'):
            d['ip'] = item.get('hostIP')
            d['iptype'] = 'hostIP'
            # d['hostIP'] = item.get('hostIP')
            table.append(d)
        elif item.get('clusterIP') and not item.get('clusterIP')=='None':
            if msg['kind'] != 'Service':
                print(f'>>>> WARNING: clusterIP but not a service - investigate! msg: {msg}')
                continue
            d['ip'] = item.get('clusterIP')
            d['iptype'] = 'clusterIP'
            # d['clusterIP'] = item.get('clusterIP')
            ports = item.get('ports')
            for port in ports:
                port['portname'] = port['name'] # make sure to avoid overwriting data in existing record
                del port['name']
                d.update(port)
                # fix naming of proto
                d['proto'] = d['protocol']
                del d['protocol']
                table.append(d)
        else: 
            table.append(d)
df_ke_ext = pd.DataFrame(table).reset_index(drop=True).sort_values(['ts','kind'])
df_ke_ext

Unnamed: 0,index,version,type,ts,ts_uts,pod.hostip,pod.internalip,node.id,node.ip,filename,...,name,namespace,ts_item,ip,iptype,port,targetPort,portname,proto,nodePort
1,1,4,KE,2022-03-17T18:47:15.845631,1647542835845631000,(),(),minikube,192.168.59.100,/mnt/data/1647542836,...,default,,2022-03-17T15:19:35Z,,,,,,,
2,1,4,KE,2022-03-17T18:47:15.845631,1647542835845631000,(),(),minikube,192.168.59.100,/mnt/data/1647542836,...,kube-node-lease,,2022-03-17T15:19:33Z,,,,,,,
3,1,4,KE,2022-03-17T18:47:15.845631,1647542835845631000,(),(),minikube,192.168.59.100,/mnt/data/1647542836,...,kube-public,,2022-03-17T15:19:33Z,,,,,,,
4,1,4,KE,2022-03-17T18:47:15.845631,1647542835845631000,(),(),minikube,192.168.59.100,/mnt/data/1647542836,...,kube-system,,2022-03-17T15:19:33Z,,,,,,,
5,1,4,KE,2022-03-17T18:47:15.845631,1647542835845631000,(),(),minikube,192.168.59.100,/mnt/data/1647542836,...,sysflow,,2022-03-17T15:32:12Z,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
187,157707,4,KE,2022-03-17T19:25:50.692444,1647545150692443689,(),(),minikube,192.168.59.100,/mnt/data/1647545115,...,robot-shop,,2022-03-17T18:48:40Z,,,,,,,
188,157708,4,KE,2022-03-17T19:25:50.692444,1647545150692443689,(),(),minikube,192.168.59.100,/mnt/data/1647545115,...,robot-shop,,2022-03-17T18:48:40Z,,,,,,,
189,159405,4,KE,2022-03-17T19:26:56.746972,1647545216746972249,(),(),minikube,192.168.59.100,/mnt/data/1647545175,...,minikube,,2022-03-17T15:19:33Z,,,,,,,
190,164836,4,KE,2022-03-17T19:30:57.880767,1647545457880767012,(),(),minikube,192.168.59.100,/mnt/data/1647545415,...,minikube,,2022-03-17T15:19:33Z,,,,,,,


Finally we put the focus onto the subset of data concerned with the robot-shop application. The final DataFrame `df_ke_ext_sel` is a restriction of the KE event information to events related to the robot-shop, with a focus on addition/deletion events.

In [13]:
df_ke_ext_sel = df_ke_ext[((df_ke_ext.name == 'robot-shop') | (df_ke_ext.namespace =='robot-shop')) & ((df_ke_ext.typ == 'ADDED') | (df_ke_ext.typ == 'DELETED'))]
df_ke_ext_sel.value_counts(['kind', 'typ'])

kind       typ    
Pod        ADDED      15
           DELETED    15
Service    ADDED      14
           DELETED    14
Namespace  ADDED       2
           DELETED     1
dtype: int64

## Timeline comparison of experiment with kubernetes events

Next, we compare the timeline of the experiment with the Kubernetes events seen in the SysFlow data.

In [14]:
fig = go.Figure()
pp = pprint.PrettyPrinter(indent=4, width=80, compact=True)
log_events_cleaned = list(filter(lambda x: 'waiting' not in x[1] and 'starting' not in x[1], log_events))
# fig.add_trace(go.Scatter(x=[e[0] for e in log_events_cleaned], y=['LOGEVENT' for e in log_events_cleaned],  text=[e[1] for e in log_events_cleaned], mode='markers', marker_size=10, marker_symbol='diamond-open'))
for e in log_events_cleaned:
    fig.add_annotation(x=e[0], xref='x', yref='paper', y=1., text=e[1], xanchor='left', showarrow=True, textangle=-35, arrowwidth=2)
    fig.add_shape(dict(type="line", x0=e[0], y0=0, x1=e[0], y1=1, xref='x', yref='paper', line=dict(color="RoyalBlue", width=2)))

texts = ['<span style="font-size:x-small">' + 
    df_ke_ext_sel.iloc[i]['k8s.kind'] +'<br>' +
    df_ke_ext_sel.iloc[i]['k8s.action'] +'<br>' +
    pp.pformat(json.loads(df_ke_ext_sel.iloc[i]['k8s.msg'])).replace('\n', ' <br> ') +
    '</span>'
    for i in range(df_ke_ext_sel.shape[0])]
colors = ['green' if df_ke_ext_sel.iloc[i]['k8s.action'] == 'K8S_COMPONENT_ADDED' else 'red'
    for i in range(df_ke_ext_sel.shape[0])]
symbols = ['triangle-up' if df_ke_ext_sel.iloc[i]['k8s.action'] == 'K8S_COMPONENT_ADDED' else 'triangle-down'
    for i in range(df_ke_ext_sel.shape[0])]
   
fig.add_trace(go.Scatter(x=df_ke_ext_sel.ts, 
                         y=[df_ke_ext_sel.iloc[i]['k8s.kind'] for i in range(df_ke_ext_sel.shape[0])],
                         text=texts,
                         mode='markers', marker_size=20, marker_color=colors, marker_symbol=symbols, marker_line_color='black', marker_line_width=1))
    
fig.update_layout(height=900, margin=dict(t=200, pad=4), showlegend=False)
fig.show()

Green triangles represent the creation of a component (`K8S_COMPONENT_ADDED`), whereas red triangles represent the deletion of a component (`K8S_COMPONENT_DELETED`). It is quite clear that we find KE events for all the changes related to the robot-shop application happening in the experiment: creation/deletion of the namespace robot-shop, creation/deletion of services and creation/deletion of pods, on beginning and end, but also seen when we forcibly killed some of the containers of the robot-shop application.

## IP address information in `KE` records

Let's now take a quick look into the IP data gathered from the KE records.

In [15]:
df_ke_ext_sel.dropna(subset=['iptype'])[['kind', 'name', 'namespace', 'iptype', 'ip', 'proto', 'port', 'targetPort', 'portname']].sort_values('ip').drop_duplicates()

Unnamed: 0,kind,name,namespace,iptype,ip,proto,port,targetPort,portname
63,Service,shipping,robot-shop,clusterIP,10.103.83.70,TCP,8080.0,8080.0,http
64,Service,ratings,robot-shop,clusterIP,10.104.84.135,TCP,80.0,80.0,http
58,Service,user,robot-shop,clusterIP,10.105.106.134,TCP,8080.0,8080.0,http
62,Service,mysql,robot-shop,clusterIP,10.107.7.181,TCP,3306.0,3306.0,mysql
147,Service,payment,robot-shop,clusterIP,10.108.220.250,TCP,8080.0,8080.0,http
69,Service,mongodb,robot-shop,clusterIP,10.109.105.252,TCP,27017.0,27017.0,mongo
146,Service,cart,robot-shop,clusterIP,10.109.213.103,TCP,8080.0,8080.0,http
141,Service,rabbitmq,robot-shop,clusterIP,10.109.218.161,TCP,4369.0,4369.0,tcp-epmd
60,Service,redis,robot-shop,clusterIP,10.111.214.104,TCP,6379.0,6379.0,redis
137,Service,catalogue,robot-shop,clusterIP,10.96.58.129,TCP,8080.0,8080.0,http


In the data, we can recognize quite a bit of IP address information:
- in some cases we see a `hostIP` (that corresponds to the `node.ip` data in the records)
- podIPs in the private range 172.17.0.0/16 showing the main internal IPs of the respective pods
- clusterIPs in the private range 10.0.0.0/8: these are most interesting as they represent the IP addresses of the service endpoints, i.e., this is new information of the cluster level. This is more complex information as it not only includes IP address, but the service is specific also to a port and has additional information!

Let us keep track of the pod IPs gleaned from this data for later use in the comparison with the observed network traffic.

In [16]:
podips = {}
for irow, row in df_ke_ext_sel.dropna(subset=['iptype'])[['kind', 'name', 'namespace', 'iptype', 'ip', 'proto', 'port', 'targetPort', 'portname']].sort_values('ip').drop_duplicates().iterrows():
    if not row['iptype'] == 'podIP': continue
    podips.setdefault(row['ip'], set()).add(row['name'])
podips

{'172.17.0.10': {'mysql-6d778f4c8f-4bcr7'},
 '172.17.0.11': {'shipping-7f6dfbf46f-94trr'},
 '172.17.0.12': {'ratings-7ccf67b49f-6qckr'},
 '172.17.0.13': {'dispatch-69b65d89b9-4lgl7'},
 '172.17.0.14': {'payment-5465d9cc79-8ln4b'},
 '172.17.0.15': {'redis-0'},
 '172.17.0.4': {'catalogue-998b69bc9-bfnr7'},
 '172.17.0.5': {'rabbitmq-785b678f74-mhhtg'},
 '172.17.0.6': {'user-899b6c7ff-c7wnj'},
 '172.17.0.8': {'cart-7d7745696b-qgb99'},
 '172.17.0.9': {'mongodb-67c5456f4-d4bgv', 'web-77486f858f-jnf9r'}}

# New `pod.*` fields

SysFlow records identify the containers they belong to. In the context of a Kubernetes/OpenShift cluster, each container belongs to a *pod*, which in turn is part of a *namespace*. Every SysFlow record now contains this metadata that helps to put the low-level, container-related information into the context of the cluster.

## Container vs Pod

Let us first look into the relationship between containers and pods. To make this easier, let's focus again on the containers related to the robot-shop application.

In [17]:
df[df['container.name'].str.contains('robot-shop')].value_counts(['container.name', 'container.id', 'container.image', 'pod.name'], sort=False).reset_index()

Unnamed: 0,container.name,container.id,container.image,pod.name,0
0,k8s_POD_cart-7d7745696b-qgb99_robot-shop_06111...,9f2d50473bb4,k8s.gcr.io/pause:3.6:3.6,,2
1,k8s_POD_catalogue-998b69bc9-bfnr7_robot-shop_4...,eb32dd737f52,k8s.gcr.io/pause:3.6:3.6,,2
2,k8s_POD_dispatch-69b65d89b9-4lgl7_robot-shop_b...,63e3c777b5c7,k8s.gcr.io/pause:3.6:3.6,,2
3,k8s_POD_mongodb-67c5456f4-d4bgv_robot-shop_646...,9d90554ce0a6,k8s.gcr.io/pause:3.6:3.6,,2
4,k8s_POD_mongodb-67c5456f4-ddhnf_robot-shop_423...,a1397d34b86e,k8s.gcr.io/pause:3.6:3.6,,2
5,k8s_POD_mysql-6d778f4c8f-4bcr7_robot-shop_93bb...,cabf4edbc827,k8s.gcr.io/pause:3.6:3.6,,2
6,k8s_POD_payment-5465d9cc79-8ln4b_robot-shop_39...,f4631d398156,k8s.gcr.io/pause:3.6:3.6,,2
7,k8s_POD_rabbitmq-785b678f74-mhhtg_robot-shop_b...,ba13946ea53f,k8s.gcr.io/pause:3.6:3.6,,2
8,k8s_POD_ratings-7ccf67b49f-6qckr_robot-shop_ee...,4d4b1cf9894d,k8s.gcr.io/pause:3.6:3.6,,2
9,k8s_POD_redis-0_robot-shop_29c707fd-318a-4af4-...,b3f1af98e495,k8s.gcr.io/pause:3.6:3.6,,2


To understand this better, let's take a look at one specific container of the robot-shop setup, picking up the data for the `mongodb` container, as this is also one of the containers that gets killed as part of the experiment and subsequently gets restarted by Kubernetes.

In [18]:
df[df['container.name'].str.contains('mongodb')].sort_values('ts_uts').groupby(['container.name', 'container.id', 'pod.name']).agg({'container.image': 'first', 'ts': ['min', 'max']}).reset_index().sort_values(by=[('ts', 'min')])

Unnamed: 0_level_0,container.name,container.id,pod.name,container.image,ts,ts
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,first,min,max
2,k8s_mongodb_mongodb-67c5456f4-d4bgv_robot-shop...,cd4ea3f3e8ae,,sha256:621ddd7848a2327f471de8541d8b020d65a58a1...,2022-03-17T18:50:01.173124,2022-03-17T18:50:14.874146
3,k8s_mongodb_mongodb-67c5456f4-d4bgv_robot-shop...,cd4ea3f3e8ae,mongodb-67c5456f4-d4bgv,sha256:621ddd7848a2327f471de8541d8b020d65a58a1...,2022-03-17T18:50:01.276773,2022-03-17T19:04:48.912214
0,k8s_POD_mongodb-67c5456f4-d4bgv_robot-shop_646...,9d90554ce0a6,,k8s.gcr.io/pause:3.6:3.6,2022-03-17T19:04:49.484837,2022-03-17T19:04:49.485062
4,k8s_mongodb_mongodb-67c5456f4-ddhnf_robot-shop...,4269f06be75a,,sha256:621ddd7848a2327f471de8541d8b020d65a58a1...,2022-03-17T19:04:55.566971,2022-03-17T19:05:13.797194
5,k8s_mongodb_mongodb-67c5456f4-ddhnf_robot-shop...,4269f06be75a,mongodb-67c5456f4-ddhnf,sha256:621ddd7848a2327f471de8541d8b020d65a58a1...,2022-03-17T19:04:55.568221,2022-03-17T19:20:39.186520
1,k8s_POD_mongodb-67c5456f4-ddhnf_robot-shop_423...,a1397d34b86e,,k8s.gcr.io/pause:3.6:3.6,2022-03-17T19:20:40.328251,2022-03-17T19:20:40.328324


From this listing we observe that:
- there are 4 `container.id`s involved, 2 each having the same `container.image` - corresponding to our killing of the first `mongodb` container and its restart
- when the container comes up first, we do not see yet related pod information (no `pod.name` here), only slightly after the creation
- when the container gets killed or stopped, we see for a very short time a container with the name `k8s_POD_mongodb-...` using the `pause` image

As long as we have all information, including the pod data, the relationship between container and pod is unique:

In [19]:
df[df['pod.name'].astype(bool)][['container.name', 'pod.name']].drop_duplicates().sort_values('container.name').reset_index(drop=True)

Unnamed: 0,container.name,pod.name
0,k8s_catalogue_catalogue-998b69bc9-bfnr7_robot-...,catalogue-998b69bc9-bfnr7
1,k8s_coredns_coredns-64897985d-n4jjl_kube-syste...,coredns-64897985d-n4jjl
2,k8s_dispatch_dispatch-69b65d89b9-4lgl7_robot-s...,dispatch-69b65d89b9-4lgl7
3,k8s_etcd_etcd-minikube_kube-system_fc45a20ce68...,etcd-minikube
4,k8s_kube-apiserver_kube-apiserver-minikube_kub...,kube-apiserver-minikube
5,k8s_kube-controller-manager_kube-controller-ma...,kube-controller-manager-minikube
6,k8s_kube-proxy_kube-proxy-9g9kt_kube-system_e4...,kube-proxy-9g9kt
7,k8s_kube-scheduler_kube-scheduler-minikube_kub...,kube-scheduler-minikube
8,k8s_mongodb_mongodb-67c5456f4-d4bgv_robot-shop...,mongodb-67c5456f4-d4bgv
9,k8s_mongodb_mongodb-67c5456f4-ddhnf_robot-shop...,mongodb-67c5456f4-ddhnf


## IP information from k8s metadata in `pod.*` fields

### IPs of the Pods aka `pod.internalip`

Let us start by looking specifically at the `pod.internalip` field first, limiting ourselves to SysFlow records for the `robot-shop` namespace:

In [20]:
df_rs = df[df['pod.ns'] == 'robot-shop']
df_rs.sort_values('ts_uts').groupby(['pod.name', 'pod.internalip']).agg({'ts': ['min', 'max']}).reset_index().sort_values('pod.name')

Unnamed: 0_level_0,pod.name,pod.internalip,ts,ts
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,min,max
0,catalogue-998b69bc9-bfnr7,"(172.17.0.4,)",2022-03-17T18:49:59.867636,2022-03-17T19:21:08.694894
1,dispatch-69b65d89b9-4lgl7,"(172.17.0.13,)",2022-03-17T18:50:02.550097,2022-03-17T19:21:08.963347
2,mongodb-67c5456f4-d4bgv,"(172.17.0.9,)",2022-03-17T18:50:01.276773,2022-03-17T19:04:48.912214
3,mongodb-67c5456f4-ddhnf,"(172.17.0.16,)",2022-03-17T19:04:55.568221,2022-03-17T19:20:39.186520
4,mysql-6d778f4c8f-4bcr7,"(172.17.0.10,)",2022-03-17T18:50:01.452982,2022-03-17T19:20:41.989241
5,payment-5465d9cc79-8ln4b,"(172.17.0.14,)",2022-03-17T18:51:37.241520,2022-03-17T19:21:09.370607
6,rabbitmq-785b678f74-mhhtg,"(172.17.0.5,)",2022-03-17T18:51:01.009719,2022-03-17T19:21:09.582437
7,ratings-7ccf67b49f-6qckr,"(172.17.0.12,)",2022-03-17T18:50:02.401834,2022-03-17T19:20:40.260521
8,redis-0,"(172.17.0.15,)",2022-03-17T18:50:02.520990,2022-03-17T19:20:37.733984
9,shipping-7f6dfbf46f-94trr,"(172.17.0.11,)",2022-03-17T18:50:02.571352,2022-03-17T19:20:40.710991


We can see here that IP addresses are allocated to pods, i.e., if a new pod is respawned after the old one is killed (e.g., for the `mongodb-*` pod), it will receive a new IP address (old ones being reused later on, e.g., `172.17.0.9`).

We use this information to update our list of `podips` with the additional data found here.

In [21]:
for irow, row in df_rs.sort_values('ts_uts')[['pod.name', 'pod.internalip']].drop_duplicates().iterrows():
    for ip in row['pod.internalip']:
        podips.setdefault(ip, set()).add(row['pod.name'])
podips

{'172.17.0.10': {'mysql-6d778f4c8f-4bcr7'},
 '172.17.0.11': {'shipping-7f6dfbf46f-94trr'},
 '172.17.0.12': {'ratings-7ccf67b49f-6qckr'},
 '172.17.0.13': {'dispatch-69b65d89b9-4lgl7'},
 '172.17.0.14': {'payment-5465d9cc79-8ln4b'},
 '172.17.0.15': {'redis-0'},
 '172.17.0.4': {'catalogue-998b69bc9-bfnr7'},
 '172.17.0.5': {'rabbitmq-785b678f74-mhhtg'},
 '172.17.0.6': {'user-899b6c7ff-c7wnj'},
 '172.17.0.8': {'cart-7d7745696b-qgb99'},
 '172.17.0.9': {'mongodb-67c5456f4-d4bgv', 'web-77486f858f-jnf9r'},
 '172.17.0.16': {'mongodb-67c5456f4-ddhnf'},
 '172.17.0.7': {'user-899b6c7ff-qxd47'}}

### IPs for the robot-shop services

The information in the `pod.services` attribute shows us the services running in the robot-shop application and gives us details like IP address and port for each service:

In [22]:
table = []
for irow, row in df_rs.drop_duplicates(subset=['pod.name', 'pod.services_str']).iterrows():
    for service in row['pod.services']:
        # resolve portList x clusterIP
        for cip in service['clusterIP']:
            for port in service['portList']:
                svc = service.copy()
                del svc['portList']
                svc.update(port)
                del svc['clusterIP']
                svc['clusterIP'] = cip
                svc['pod.name'] = row['pod.name']
                # svc.update(row)
                table.append(svc)
df_services = pd.DataFrame(table)
df_services

Unnamed: 0,name,id,namespace,port,targetPort,nodePort,proto,clusterIP,pod.name
0,redis,81bc1a3e-f067-4b37-a077-54061050c1cb,robot-shop,6379,6379,0,TCP,10.111.214.104,redis-0
1,mongodb,09131df1-404c-4a02-bba2-ebe1e4393caa,robot-shop,27017,27017,0,TCP,10.109.105.252,mongodb-67c5456f4-d4bgv
2,user,f1686819-1201-4e1b-8d99-8579007104de,robot-shop,8080,8080,0,TCP,10.105.106.134,user-899b6c7ff-c7wnj
3,catalogue,d34b8f78-0865-4b54-a600-c011fe449545,robot-shop,8080,8080,0,TCP,10.96.58.129,catalogue-998b69bc9-bfnr7
4,shipping,e11145c8-1fe9-4f2d-a9a0-07d039d519ff,robot-shop,8080,8080,0,TCP,10.103.83.70,shipping-7f6dfbf46f-94trr
5,ratings,1f438794-721a-4641-8643-8964cba70095,robot-shop,80,80,0,TCP,10.104.84.135,ratings-7ccf67b49f-6qckr
6,mysql,d92b4e08-03bf-49a5-954b-026f7744482e,robot-shop,3306,3306,0,TCP,10.107.7.181,mysql-6d778f4c8f-4bcr7
7,rabbitmq,9a209f55-ed6f-4224-b6d3-2a056f9c7783,robot-shop,5672,5672,0,TCP,10.109.218.161,rabbitmq-785b678f74-mhhtg
8,rabbitmq,9a209f55-ed6f-4224-b6d3-2a056f9c7783,robot-shop,15672,15672,0,TCP,10.109.218.161,rabbitmq-785b678f74-mhhtg
9,rabbitmq,9a209f55-ed6f-4224-b6d3-2a056f9c7783,robot-shop,4369,4369,0,TCP,10.109.218.161,rabbitmq-785b678f74-mhhtg


Collect the high-level information for later identification in the observed network traffic:

In [23]:
services = {}
for irow, row in df_services.iterrows():
    services[(row['clusterIP'], row['port'])] = f"{row['name']}-{row['port']}"
services

{('10.111.214.104', 6379): 'redis-6379',
 ('10.109.105.252', 27017): 'mongodb-27017',
 ('10.105.106.134', 8080): 'user-8080',
 ('10.96.58.129', 8080): 'catalogue-8080',
 ('10.103.83.70', 8080): 'shipping-8080',
 ('10.104.84.135', 80): 'ratings-80',
 ('10.107.7.181', 3306): 'mysql-3306',
 ('10.109.218.161', 5672): 'rabbitmq-5672',
 ('10.109.218.161', 15672): 'rabbitmq-15672',
 ('10.109.218.161', 4369): 'rabbitmq-4369',
 ('10.108.220.250', 8080): 'payment-8080'}

# Understanding observed network traffic with cluster metadata

Let us take a look at the actually observed network traffic (i.e., the SysFlow NF records) for the robot-shop application, and collect this subset of SysFlow records into `df_rs_traffic`:

In [24]:
df_nf = df[df.type == 'NF']
df_rs_traffic = df_nf[df_nf['pod.ns']=='robot-shop'].groupby(['pod.name', 'net.sip', 'net.dip', 'net.dport']).agg({'flow.rops':'sum', 'flow.rbytes':'sum','flow.wops':'sum','flow.wbytes':'sum'}).reset_index()

... and compare these records with the knowledge about the IP address background that we gathered from the new cluster metadata attributes above (while adding also some 'well-known' background IP information manually):

In [25]:
types_source = []
types_destination = []
for irow, row in df_rs_traffic.iterrows():
    sip = row['net.sip']
    dip = row['net.dip']
    dport = row['net.dport']
    type_source = ''
    if sip == '0.0.0.0': type_source = '"localhost"'
    elif sip == '127.0.0.1': type_source = '"localhost"'
    elif sip == '172.17.0.1': type_source = '"docker network gateway"'
    elif podips.get(sip): type_source = f'POD {podips[sip]}'
    type_destination = ''
    if dip == '0.0.0.0': type_destination = '"localhost"'
    elif dip == '127.0.0.1': type_destination = '"localhost"'
    elif dip == '172.17.0.1': type_destination = '"docker network gateway"'
    elif podips.get(dip): type_destination = f'POD {podips[dip]}'
    elif dport == 42699: type_destination = '"instana agent"'
    elif sip == dip: 
            type_source = 'local'
            type_destination = 'local'
    else:
        service = services.get((dip, dport), '')
        if service != '': type_destination = f'SERVICE {service}'
        if dip == '10.96.0.10' and dport == 53: type_destination = '"cluster DNS"'
    
    types_source.append(type_source)
    types_destination.append(type_destination)
df_rs_traffic['type_source'] = types_source
df_rs_traffic['type_destination'] = types_destination

# for readability
for col in ('net.dport', 'flow.rops', 'flow.rbytes', 'flow.wops', 'flow.wbytes'):
    df_rs_traffic[col] = df_rs_traffic[col].apply(int)

In [26]:
df_rs_traffic[df_rs_traffic['flow.rbytes']>0] #[['pod.name', 'net.sip', 'net.dip', 'net.dport', 'type_source', 'type_destination']]

Unnamed: 0,pod.name,net.sip,net.dip,net.dport,flow.rops,flow.rbytes,flow.wops,flow.wbytes,type_source,type_destination
0,catalogue-998b69bc9-bfnr7,172.17.0.4,10.109.105.252,27017,359,108558,1069,18992,POD {'catalogue-998b69bc9-bfnr7'},SERVICE mongodb-27017
1,catalogue-998b69bc9-bfnr7,172.17.0.4,10.96.0.10,53,148,16136,148,6452,POD {'catalogue-998b69bc9-bfnr7'},"""cluster DNS"""
4,dispatch-69b65d89b9-4lgl7,172.17.0.13,10.109.218.161,5672,473,2469,123,1287,POD {'dispatch-69b65d89b9-4lgl7'},SERVICE rabbitmq-5672
5,dispatch-69b65d89b9-4lgl7,172.17.0.13,10.96.0.10,53,2779,180224,1408,77440,POD {'dispatch-69b65d89b9-4lgl7'},"""cluster DNS"""
9,mongodb-67c5456f4-d4bgv,172.17.0.1,172.17.0.9,27017,921,18316,344,104606,"""docker network gateway""","POD {'mongodb-67c5456f4-d4bgv', 'web-77486f858..."
10,mongodb-67c5456f4-ddhnf,172.17.0.1,172.17.0.16,27017,1026,19986,372,113133,"""docker network gateway""",POD {'mongodb-67c5456f4-ddhnf'}
11,mysql-6d778f4c8f-4bcr7,172.17.0.1,172.17.0.10,3306,4135,64937,1161,73956,"""docker network gateway""",POD {'mysql-6d778f4c8f-4bcr7'}
15,rabbitmq-785b678f74-mhhtg,127.0.0.1,127.0.0.1,4369,35,470,24,470,"""localhost""","""localhost"""
16,rabbitmq-785b678f74-mhhtg,172.17.0.1,172.17.0.5,5672,245,1287,236,2469,"""docker network gateway""",POD {'rabbitmq-785b678f74-mhhtg'}
17,rabbitmq-785b678f74-mhhtg,172.17.0.5,172.17.0.5,4369,39,375,27,375,POD {'rabbitmq-785b678f74-mhhtg'},POD {'rabbitmq-785b678f74-mhhtg'}


In this table we can see the summarized network flows inside the robot-shop application &mdash; how often, how many bytes have been read and written &mdash; in the context of the application, i.e., recognizing which pods and _services_ that are involved!

Small further remarks:
- If you look in detail, the above approach is not sufficient with respect to the reuse of IP addresses (i.e., this is a shortcoming of our `podips` set). In order to make this exact, one would have to keep track of the time information in addition, e.g., which IP address is used by a pod during what time interval.
- This shows the extension of the data to the cluster level, the `NF` records themselves still bind that as before to the lower-level details like the process involved etc.

# Summary

In this notebook, we have taken a first look at the new cluster metadata for Kubernetes/OpenShift clusters that is available with the recent SysFlow 0.5.0 release. 

In an experiment with Instana's robot-shop, we have seen this new cluster metadata at work and did an initial investigation into the collected data for that experiment, especially into the IP-related information newly available through the newly collected metadata.

We find that with the new data, the lower-level SysFlow data related to containers are put into the context of the cluster structure, namely pods, nodes and namespaces. 
The new `KE` records are in that respect complimentary to existing records, as they report data driven by _cluster events_ like the creation of a new pod. Conversily, the standard SysFlow records contain now the cluster metadata directly attached via the `pod.*` attributes.

With respect to observed IP addresses, the availability of the endpoint IPs/ports connected to services is especially interesting as it can be used, together with the IP information for the pods, to understand network flows internal to a cluster application.

Further extensions of these cluster-related metadata are planned, stay tuned!