# Tail Sampling Scheme

The tail sampling processor samples the links according to a set of defined policies. However,
all spans of the link must be received by the same collector instance in order to make effective sampling decisions.

Therefore, adjustments need to be made to the Global Opentelemetry Collector architecture of Insight to implement tail sampling policies.

## Specific Changes

Introducing an Otel Col with LB capability in front of the Global Opentelemetry Collector.

## Steps for Changes

### Deploy OTEL COL Component with LB Capability

Refer to the following YAML to deploy the component.

??? note "Click to view deployment configuration"

    ```yaml
    kind: ClusterRole
    apiVersion: rbac.authorization.k8s.io/v1
    metadata:
      name: insight-otel-collector-lb
    rules:
    - apiGroups: [""]
      resources: ["endpoints"]
      verbs: ["get", "watch", "list"]
    ---
    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: insight-otel-collector-lb
      namespace: insight-system
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRoleBinding
    metadata:
      name: insight-otel-collector-lb
    roleRef:
      apiGroup: rbac.authorization.k8s.io
      kind: ClusterRole
      name: insight-otel-collector-lb
    subjects:
    - kind: ServiceAccount
      name: insight-otel-collector-lb
      namespace: insight-system
    ---
    kind: ConfigMap
    metadata:
      labels:
        app.kubernetes.io/component: opentelemetry-collector
        app.kubernetes.io/instance: insight-otel-collector-lb
        app.kubernetes.io/name: insight-otel-collector-lb
      name: insight-otel-collector-lb-collector
      namespace: insight-system
    apiVersion: v1
    data:
      collector.yaml: |
        receivers:
          otlp:
            protocols:
              grpc:
              http:
          jaeger:
            protocols:
              grpc:
        processors:
    
        extensions:
          health_check:
          pprof:
            endpoint: :1888
          zpages:
            endpoint: :55679
        exporters:
          logging:
          loadbalancing:
            routing_key: "traceID"
            protocol:
              otlp:
                # all options from the OTLP exporter are supported
                # except the endpoint
                timeout: 1s
                tls:
                  insecure: true
            resolver:
              k8s:
                service: insight-opentelemetry-collector
                ports:
                  - 4317
        service:
          extensions: [pprof, zpages, health_check]
          pipelines:
            traces:
              receivers: [otlp, jaeger]
              exporters: [loadbalancing]
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      labels:
        app.kubernetes.io/component: opentelemetry-collector
        app.kubernetes.io/instance: insight-otel-collector-lb
        app.kubernetes.io/name: insight-otel-collector-lb
      name: insight-otel-collector-lb
      namespace: insight-system
    spec:
      replicas: 2
      selector:
        matchLabels:
          app.kubernetes.io/component: opentelemetry-collector
          app.kubernetes.io/instance: insight-otel-collector-lb
          app.kubernetes.io/name: insight-otel-collector-lb
      template:
        metadata:
          labels:
            app.kubernetes.io/component: opentelemetry-collector
            app.kubernetes.io/instance: insight-otel-collector-lb
            app.kubernetes.io/name: insight-otel-collector-lb
        spec:
          containers:
          - args:
            - --config=/conf/collector.yaml
            env:
            - name: POD_NAME
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: metadata.name
            image: ghcr.m.daocloud.io/openinsight-proj/opentelemetry-collector-contrib:5baef686672cfe5551e03b5c19d3072c432b6f33
            imagePullPolicy: IfNotPresent
            livenessProbe:
              failureThreshold: 3
              httpGet:
                path: /
                port: 13133
                scheme: HTTP
              periodSeconds: 10
              successThreshold: 1
              timeoutSeconds: 1
            name: otc-container
            resources:
              limits:
                cpu: '1'
                memory: 2Gi
              requests:
                cpu: 100m
                memory: 400Mi
            ports:
            - containerPort: 14250
              name: jaeger-grpc
              protocol: TCP
            - containerPort: 8888
              name: metrics
              protocol: TCP
            - containerPort: 4317
              name: otlp-grpc
              protocol: TCP
            - containerPort: 4318
              name: otlp-http
              protocol: TCP
            - containerPort: 55679
              name: zpages
              protocol: TCP
    
            volumeMounts:
            - mountPath: /conf
              name: otc-internal
    
          serviceAccount: insight-otel-collector-lb
          serviceAccountName: insight-otel-collector-lb
          volumes:
          - configMap:
              defaultMode: 420
              items:
              - key: collector.yaml
                path: collector.yaml
              name: insight-otel-collector-lb-collector
            name: otc-internal
    ---
    kind: Service
    apiVersion: v1
    metadata:
      name: insight-opentelemetry-collector-lb
      namespace: insight-system
      labels:
        app.kubernetes.io/component: opentelemetry-collector
        app.kubernetes.io/instance: insight-otel-collector-lb
        app.kubernetes.io/name: insight-otel-collector-lb
    spec:
      ports:
        - name: fluentforward
          protocol: TCP
          port: 8006
          targetPort: 8006
        - name: jaeger-compact
          protocol: UDP
          port: 6831
          targetPort: 6831
        - name: jaeger-grpc
          protocol: TCP
          port: 14250
          targetPort: 14250
        - name: jaeger-thrift
          protocol: TCP
          port: 14268
          targetPort: 14268
        - name: metrics
          protocol: TCP
          port: 8888
          targetPort: 8888
        - name: otlp
          protocol: TCP
          appProtocol: grpc
          port: 4317
          targetPort: 4317
        - name: otlp-http
          protocol: TCP
          port: 4318
          targetPort: 4318
        - name: zipkin
          protocol: TCP
          port: 9411
          targetPort: 9411
        - name: zpages
          protocol: TCP
          port: 55679
          targetPort: 55679
      selector:
        app.kubernetes.io/component: opentelemetry-collector
        app.kubernetes.io/instance: insight-otel-collector-lb
        app.kubernetes.io/name: insight-otel-collector-lb
    ```

### Configure Tail Sampling Rules

!!! note

    Tail sampling rules need to be added to the existing insight-otel-collector-config configmap configuration group.

1. Add the following content in the `processor` section, and adjust the specific rules as needed; refer to the
   [OTel official example](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/processor/tailsamplingprocessor/README.md#a-practical-example).

    ```yaml
    ........
    tail_sampling:
      decision_wait: 10s # Wait for 10 seconds, traces older than 10 seconds will no longer be processed
      num_traces: 1500000  # Number of traces saved in memory, assuming 1000 traces per second, should not be less than 1000 * decision_wait * 2;
                           # Setting it too large may consume too much memory resources, setting it too small may cause some traces to be dropped
      expected_new_traces_per_sec: 10
      policies: # Reporting policies
        [
            {
              name: latency-policy,
              type: latency,  # Report traces that exceed 500ms
              latency: {threshold_ms: 500}
            },
            {
              name: status_code-policy,
              type: status_code,  # Report traces with ERROR status code
              status_code: {status_codes: [ ERROR ]}
            }
        ]
    ......
    tail_sampling: # Composite sampling
      decision_wait: 10s # Wait for 10 seconds, traces older than 10 seconds will no longer be processed
      num_traces: 1500000  # Number of traces saved in memory, assuming 1000 traces per second, should not be less than 1000 * decision_wait * 2;
                           # Setting it too large may consume too much memory resources, setting it too small may cause some traces to be dropped
      expected_new_traces_per_sec: 10
      policies: [
          {
            name: debug-worker-cluster-sample-policy,
            type: and,
            and:
              {
                and_sub_policy:
                  [
                    {
                      name: service-name-policy,
                      type: string_attribute,
                      string_attribute:
                        { key: k8s.cluster.id, values: [xxxxxxx] },
                    },
                    {
                      name: trace-status-policy,
                      type: status_code,
                      status_code: { status_codes: [ERROR] },
                    },
                    {
                      name: probabilistic-policy,
                      type: probabilistic,
                      probabilistic: { sampling_percentage: 1 },
                    }
                  ]
              }
          }
        ]
    ```

2. Activate this `processor` in the otel col pipeline:

    ```yaml
    traces:
      exporters:
        - servicegraph
        - otlp/jaeger
      processors:
        - memory_limiter
        - tail_sampling # 👈
        - batch
      receivers:
        - otlp
    ```

3. Restart the `insight-opentelemetry-collector` component.

4. When deploying the Insight-agent, modify the reporting address of the link data to the `4317` port address of the `otel-col` LB.

    ```yaml
    ....
        exporters:
          otlp/global:
            endpoint: insight-opentelemetry-collector-lb.insight-system.svc.cluster.local:4317  # 👈 Modify to lb address
    ```