Debug Incoming forwarders 62 (index=core_splunk_internal OR index=_internal) Metrics (sourcetype=splunkd OR sourcetype=metrics) (index=core_splunk_internal OR index=_internal) sourcetype=splunkd $metrics$ TERM(group=tcpin_connections) $selected_targets$ | stats sum(kb) as sum_kb avg(chan_new_kBps) as avg_chan_new_kBps max(tcp_KBps) as max_tcp_KBps stdev(tcp_KBps) as stdev_tcp_KBps values(connectionType) as connectionType values(arch) as arch values(version) as version values(fwdType) as fwdType values(ssl) as ssl values(os) as os values(guid) as guid dc(guid) as guid_count dc(sourceIp) as count_sources dc(host) as indexer_count by hostname | addinfo | table * $time.earliest$ $time.latest$ $job.resultCount$ $result.info_max_time$-$result.info_min_time$ $job.resultCount$ $result.info_max_time$-$result.info_min_time$ | eventstats sum(max_tcp_KBps) as total_sum_avg_KBps stdev(tcp_KBps) as avg_stdev_KBps sum(sum_kb) as total_sum_kb dc(guid) as all_forwarders max(indexer_count) as all_indexers | eval target_coverage=indexer_count."/".all_indexers, target_coverage_pct=indexer_count/all_indexers | sort 0 - sum_kb | streamstats sum(sum_kb) as accumlated_sum_kb count as ranking_most_data_kb by all_forwarders | eval coverage_kb=accumlated_sum_kb/total_sum_kb, progress_through_forwarders_kb=(ranking_most_data_kb/all_forwarders) * 100 | sort 0 - max_tcp_KBps | streamstats sum(max_tcp_KBps) as accumlated_avg_kbps count as ranking_most_data_kbps by all_forwarders | eval coverage_kbps=accumlated_avg_kbps/total_sum_avg_KBps, progress_through_forwarders_kbps=(ranking_most_data_kbps/all_forwarders) * 100 | table hostname guid guid_count * $result.all_indexers$ round($result.total_sum_kb$/(1024*1024),2) $result.all_indexers$ round($result.total_sum_kb$/(1024*1024),2) round(($total_sum_gb$*1024)/$duration$,2) $metrics$ TERM(group=tcpin_connections) $selected_forwarders$ $selected_targets$ | table _time host hostname guid guid_count kb tcp_Bps tcp_KBps tcp_avg_thruput tcp_Kprocessed tcp_eps process_time_ms chan_new_kBps evt_misc_kBps evt_raw_kBps evt_fields_kBps evt_fn_kBps evt_fv_kBps evt_fn_str_kBps evt_fn_meta_dyn_kBps evt_fn_meta_predef_kBps evt_fn_meta_str_kBps evt_fv_num_kBps evt_fv_str_kBps evt_fv_predef_kBps evt_fv_offlen_kBps evt_fv_fp_kBps $time.earliest$ $time.latest$ | search (progress_through_forwarders_kbps>=$selected_min$, progress_through_forwarders_kbps<=$selected_max$) OR ( progress_through_forwarders_kb>=$selected_min$, progress_through_forwarders_kb<=$selected_max$) | rename target_coverage_pct as "%" | rename target_coverage as "indexer coverage" | rename ranking_most_data_kbps as "speed ranking" | rename ranking_most_data_kb as "volume ranking" | rename max_tcp_KBps as "max speed" | eval type=if(fwdType="full","HWF","UF") | eval "%"='%'*100, "speed variability" = (stdev_tcp_KBps/'max speed')*100, "data"=sum_kb/1024/1024, "data %"= (sum_kb/total_sum_kb) * 100 | table hostname guid guid_count "volume ranking" "speed ranking" "indexer coverage" "%" "max speed" "speed variability" "data" "data %" "os" "type" "version" arch | sort 0 + "volume ranking" | tstats count where $metrics$ $selected_targets$ TERM(blocked=true) TERM(group=queue) TERM(name=tcpin*) by host _time span=$seconds_for_bin$sec | table _time host count | eval annotation_label="Blocked input", annotation_category=host $time.earliest$ $time.latest$ | stats values(hostname) as hosts | eval hosts="host IN (".mvjoin(hosts,", ").")" $result.hosts$ $result.hosts$ | tstats count where $metrics$ $selected_targets$ by host | streamstats count $job.resultCount$ -15m -13m

1. Select CM, site or hosts label search index=_internal INFO TERM(instance_roles=*) (cluster_master OR indexer) sourcetype=splunkd TERM(group=instance) | fields host instance_roles index_cluster_label | eval cluster_master=if(like(instance_roles,"%cluster_master%"),1,0), indexer=if(like(instance_roles,"%indexer%"),1,0) | stats values(eval(if(cluster_master=1,host,""))) as cluster_master values(eval(if(indexer=1,host,""))) as indexer by index_cluster_label | rex field=cluster_master "^(?<short_name_cm>[^\.]+)" | eval label=short_name_cm." (".index_cluster_label.")", search="host IN (".mvjoin(mvfilter(indexer!=""), ", ").")" | where isnotnull(label) | table label search -24h@h now $selected_indexers$ None None Selected targets Select time range of the report -60m@m now if((round(relative_time(now(), $time.latest$)-relative_time(now(), $time.earliest$))/$time_resolution$)<31,31,round((relative_time(now(), $time.latest$)-relative_time(now(), $time.earliest$))/$time_resolution$)) Chart resolution Crude Low Medium High Ultra 500 if((round(relative_time(now(), $time.latest$)-relative_time(now(), $time.earliest$))/$time_resolution$)<31,31,round((relative_time(now(), $time.latest$)-relative_time(now(), $time.earliest$))/$time_resolution$)) Show documentation panel Show documentation hide

Please select an cluster, site or host

What is this dashboard and how do I use it?

When a forwarder connects to a target, that targets report details about the connected forwarder. This information includes all sorts of interesting datapoints including, the version, the operating system, how much data has been sent and the average data rate.

This dashboard aggressively data mines indexers for this information about its connecting forwarder and creates a tooling that lets you explore and visualize how the forwarder pool and how they are interacting with the indexers. This dashboard is specifically built to identify the problematic forwarders with a focus on “super-giant forwarders”.

It is surprisingly common for 1% of the forwarder pool to be sending more than 50% of all the data into a cluster. This can be highly problematic as they can cause excessive amounts of data to be delivered into the cluster. This is often referred as “laser beams of death” that sweep the cluster causing ingestion hotspots, event distribution problems, slow search times, event delay and instability. It is very important that these forwarders are configured to minimise these affects.

The top chart visualises the entire population of forwarders on the x-axis, from 0% percent of the pool to 100% the pool. The Y axis shows all data being sent into the cluster five forwarders, again from 0% percent of the data to 100% the data. The curve that is generated describes how data is being received into the cluster. If every forwarder sent exactly the same amount of data (perfect distribution) we would have a straight-line cutting across to chart. The greater the deviation the straight line the greater the asymmetry between forwarders.

Select ranges in the chart and the table below will be filtered. For instance if you select the first 5% in the chart you will get the top 5% forwarders by received data volume or by peak speed. This interaction quickly allows you to filter to the most important forwarders and whether they are configured to work well.

Targets recieved $total_sum_gb$ GB of data from connecting over $duration$ seconds = $average_data_rate_ps$ MB/s Selected range covers from $selected_min$% to $selected_max$% ($selected_forwarders_pct$%) of all forwarders in aggregate they send approximately $selected_volume_pct$% (~$selected_volume_abs$GB) of data recieved

| bin progress_through_forwarders_kb bins=1000 start=0 end=100 | bin progress_through_forwarders_kbps bins=1000 start=0 end=100 | appendpipe [| stats max(coverage_kb) as coverage by progress_through_forwarders_kb all_forwarders | sort + progress_through_forwarders_kb | fields progress_through_forwarders_kb coverage | rename progress_through_forwarders_kb as progress_through_forwarders | eval column="volume"] | appendpipe [| stats max(coverage_kbps) as coverage by progress_through_forwarders_kbps all_forwarders | sort + progress_through_forwarders_kbps | fields progress_through_forwarders_kbps coverage | rename progress_through_forwarders_kbps as progress_through_forwarders | eval column="speed"] | where isnotnull(column) | rex field=progress_through_forwarders "(?<sort>/d+)-" | chart values(coverage) by progress_through_forwarders column | eval "volume %"=round(volume*100,1), "speed %"=round(speed*100,1) | fields - volume speed substr($start$,1,len(rtrim($start$,"0123456789."))-1) substr($end$,-(len($end$)-len(rtrim($end$,"0123456789.")))) round($selected_max$-$selected_min$,1) round((($end.volume %$/100-$start.volume %$/100))*$total_sum_gb$,0) round($end.volume %$-$start.volume %$,2)

Variable	Description
hostname	The hostname of the forwarder as reported by the receiving indexer. This might not be the same as the DNS hostname, and might not match the “host” value written into the index.
guid	The GUID for the forwarder. Sometimes this can have multiple values, I don't know what is causing this. Maybe the GUID is changing, maybe there are multiple forwarders with the same name.
volume ranking	How the forwarder was ranking for data generation, #1 is the forwarder that sent the most data for the duration of the report.
speed ranking	How the forwarder was ranked by peak ingestion rate, #1 is the forwarder with the highest ingestion peak. These forwarders might be suffering event delay if they have saturatied the ingestion queue of the indexers
indexer coverage	How many of the of the indexers pool did the forwarder connect to during the duration of the report? This needs to happen quickly to get good event distribution. The bigger the indexing cluster the longer it takes to sweep the cluster. Ideally very forwarder should sweep the entire pool within 15mins to get good balanced search performance for small search windows.
%	This is a percentage coverage for indexer coverage. It is normalized to so that values less than 100% can be highlighted.
max speed	The max speed for the forwarder sending data.
speed variability	The variation in speed generated by the forwarder, a low value implies a nice constant speed, a high value implies variable data rates. Forwarders with high variation should be configured with autoLBVolume so that the switching rate increases as data rates increase.
Data	The total amount of data sent by the forwarder into the pool.
data %	The data sent by the forwarder as a percentage of the total ingestion.
OS	The operating system of the forwarder, typically Linux or Windows
Type	The type of the forwarder, UF or HWF
Version	The version of the forwarder, some versions have bugs, some don’t have features.
Arch	The architecture of the CPU for the forwarder

Showing $selected_forwarder_count$ of $total_forwarder_count$ forwarders Filter forwarder version * | stats count by version | eval label=version." (".count." instances)" label version * Filter forwarder type * | stats count by type | eval label=type." (".count." instances)" | sort - count label type * Filter forwarder OS * | stats count by os | eval label=os." (".count." instances)" | sort - count label os * Filter forwarder coverage * | stats count by "indexer coverage" | rename "indexer coverage" as coverage | eval label=coverage." (".count." instances)" | sort - count label coverage * hostname filter * For more detail click on some forwarders and more charts will open allowing you to compare performance and behaviours of those you selected $job.resultCount$ $job.resultCount$ | search type=$selected_forwarder_type$ os=$selected_forwarder_os$ version=$selected_forwarder_version$ "indexer coverage"="$selected_forwarder_coverage$" hostname=$filter_hostname$ | rename guid_count as "# guids" [#D93F3C,#FFFFFF] 99 if((round(relative_time(now(), $time.latest$)-relative_time(now(), $time.earliest$))/$time_resolution$)<31,31,round((relative_time(now(), $time.latest$)-relative_time(now(), $time.earliest$))/$time_resolution|s$)) mvdedup(mvappend($selected_forwarder$,$row.hostname$)) "hostname=".mvjoin($selected_forwarder$," OR hostname=")

Drill down to intermediate forwarders

Reopen this dashboard with selected hosts as targets

Drill down to other dashboards for futher analysis | stats count by hostname | eval debug_ingestion=hostname, event_delay_for_host=hostname | fields - count hostname $click.name2$?form.selected_host=$click.value$&form.time.earliest=$time.earliest$&form.time.latest=$time.latest$

Metrics for selected forwarders Select metric kb tcp_Bps tcp_KBps tcp_avg_thruput tcp_Kprocessed tcp_eps process_time_ms chan_new_kBps evt_misc_kBps evt_raw_kBps evt_fields_kBps evt_fn_kBps evt_fv_kBps evt_fn_str_kBps evt_fn_meta_dyn_kBps evt_fn_meta_predef_kBps evt_fn_meta_str_kBps evt_fv_num_kBps evt_fv_str_kBps evt_fv_predef_kBps evt_fv_offlen_kBps evt_fv_fp_kBps tcp_avg_thruput Is channel creation healthy for each forwarder? The excessive channel creation can cause throughput problems on the indexers | timechart limit=0 span=$seconds_for_bin$sec max($selected_metric$) by hostname The percentage of the cluster the forwarder sent to over time Select aggregator Count distinct hosts Count dc(host) as count How did each forwarder connect to the cluster over time? Did it connect to every indexer in the cluster? How long did it take to sweep all indexers? | bin span=$seconds_for_bin$sec _time | sort 0 + hostname _time | streamstats $count_aggregator$ by hostname | xyseries _time hostname count | addinfo | eval time_mins = round((_time - info_min_time) / 60,0), all_targets=$total_indexer_count$ | fields - info_* _time | table time_mins all_indexers * $targest_found$ Targets found with filter - click to drill down into debug ingestion 1 2 3 4 5 6 7 8 5 $selected_targets$ | eval count=count-1, column=(count)% $selected_no_columns$, row=floor(count / $selected_no_columns$) | xyseries row column host | sort row | fields - row debug_ingestion?form.selected_host=$click.value$&form.time.earliest=$time.earliest$&form.time.latest=$time.latest$