The Problem
Observing distributed systems is not a trivial task. We have several machines running, each hosting workloads that minding their business, do network calls, and print logs. That makes one think “how on earth am I gonna keep it together?”.
One opinionated solution
Well, there’s a widely adopted tool that works as a central piece, gracefully turning chaos into an ordered flow. It is called OpenTelemetry Collector, and this article approaches it in an opinionated way.
Architecture overview
- One otel-collector runs as a DaemonSet to capture and process metrics, logs, and traces
- One trace-collector runs as a Deployment to reconstruct traces across different nodes
- Each node gets a subset of app metrics to scrape
- Metrics are scraped from Apps, Kubernetes, and the Host itself, and then forwarded to Victoriametrics
- Log files are tailed, and forwarded to Victorialogs
- Traces are received from app instrumentation, glued together in trace-collector, and forwarded to Victoriatraces.
- They are also parsed to
calls_total and duration_milliseconds_.* metrics
Introduction
The collector has three main pieces that are responsible for orchestrating the telemetry flow. Receivers, Processors, and Exporters. OpenTelemetry Collector has this repo opentelemetry-collector-contrib that offers, abundantly, options of these three.
These three pieces are then orchestrated by a Pipeline, and for traces <> metrics, transformed via Connector.
Installing the collector
The easiest way is via the operator, which enables several CustomResourceDefinitions. The one we are mainly interested is OpenTelemetryCollector. If you $ kubectl explain opentelemetrycollector --recursive, it can be a bit scary of how many settings there are. But we’ll approach the most relevant ones in this article.
Telemetry pipelines
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
| metrics:
receivers:
- hostmetrics
- hostmetrics/disk # Get host metrics
- kubeletstats # Get Kubernetes metrics
- spanmetrics # Get parsed metrics from Traces
- prometheus # Get scraped metrics from the Cluster's service monitors
processors:
- memory_limiter # Keep things cool
- resource/instance # Set "instance" label so we know, in which machine, the daemonset pod has processed the telemetry
- k8sattributes # Label metrics with infos from the Kubernetes workload
- transform/metrics # Sanitize non-needed labels
- batch # Batch stuff so we don't DDoS the timeseries backend
exporters:
- prometheusremotewrite # Write to the telemetry backend
logs:
receivers:
- filelog # Tail container log files in the host
- otlp # Possibly receive other logs from apps's OpenTelemetry instrumentation
processors:
- memory_limiter # Keep things cool
- transform/logs # Drop unwanted fields
- batch # Don't DDoS the log storage backend
exporters:
- otlphttp/victoriametrics # Write to the log backend
traces:
receivers:
- otlp # Receive traces from apps' instrumentation
processors:
- memory_limiter # Keep things cool
- k8sattributes # Enrich trace metadata
- resource/instance # Say which node has processed the span
- transform/spanmetrics # Parse spans to metrics
- batch # Don't DDoS the trace storage backend
exporters:
- spanmetrics # Send newly-parsed metrics
- loadbalancing # Send spans to trace-collector for gathering cross-node spans
|
OpenTelemetry Collector has the features to make it happen. In fact, these above are pieces to achieve such telemetry flow.
Metrics
These come from four different sources
- hostmetrics receiver
- Scrape the machine metrics like cpu, memory, filesystem, network, etc
- kubeletstats receiver
- Scrape Kubernetes metrics from cAdvisor, i.e. nodes, containers, etc
- spanmetrics connector
- Receives metrics parsed from traces, e.g.
calls_total, duration_milliseconds_.*
- Prometheus receiver
Are processed with
- memory_limiter processor
- Ensures the collector doesn’t trespass the max limit and crash; at the expense of losing data
- resource processor
- Label the metric arbitrarily
- k8sattributes processor
- Label the metric from Kubernetes resource label
- transform processor
- Add and remove metric labels aiming to keep cardinality lower as possible
- batch processor
- Suggestively, gathers metrics before sending them to save some network calls
And forwarded to
- prometheusremotewrite exporter
- Sends metrics to Victoriametrics’
insert workload- In my cluster, the
prometheusremotewrite was noticeably much more memory friendly than otlphttp
Logs
These can come from two origins
- filelog receiver
- Tails pod log files logs (assuming containerd as the CRI), parses them to json when applicable, and extract resource metadata out of it
- otlp receiver
- To enable apps sending arbitrary logs not written in stdout/stderr
Are processed with
- memory_limiter processor
- Ensures the collector doesn’t trespass the max limit and crash; at the expense of losing data
- transform processor
- Aiming to keep storage lower as possible, drops everything we don’t need as a filterable field
- batch processor
And forwarded to
- otlphttp exporter
- Sends to Victorialogs
insert workload
Traces
These can come from a single origin
- otlp receiver
- App instrumentation reports traces via
otlp to their node’s Collector, on the hostPort opened by the DaemonSet
Are processed with
And forwarded to
- spanmetrics connector
- Which is also one of the metrics’ pipeline receiver
- loadbalancing exporter
- Routes spans by
traceID, so all spans are gathered in the same trace-collector, which is necessary for tail sampling
Appendix: Full manifests
otel-collector DaemonSet
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
| apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
name: otel
spec:
mode: daemonset
image: ghcr.io/open-telemetry/opentelemetry-collector-releases/opentelemetry-collector-contrib:0.140.1
resources:
requests:
cpu: 100m
memory: 448Mi
limits:
cpu: "2"
memory: 1536Mi
ports:
- name: grpc
port: 4317
hostPort: 4317
targetPort: 4317
- name: http
port: 4318
hostPort: 4318
targetPort: 4318
observability:
metrics:
enableMetrics: true
env:
- name: POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
- name: K8S_NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
targetAllocator:
enabled: true
image: ghcr.io/open-telemetry/opentelemetry-operator/target-allocator:0.138.0
resources:
requests:
cpu: 25m
memory: 128Mi
limits:
cpu: 250m
memory: 256Mi
allocationStrategy: per-node
prometheusCR:
enabled: true
scrapeInterval: 30s
serviceMonitorSelector: {}
volumes:
- name: varlogpods
hostPath:
path: /var/log/pods
volumeMounts:
- name: varlogpods
mountPath: /var/log/pods
config:
extensions:
health_check:
endpoint: ${env:POD_IP}:13133
# https://opentelemetry.io/docs/collector/components/receiver/
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
filelog:
include:
- /var/log/pods/*/*/*.log
include_file_name: false
include_file_path: true
retry_on_failure:
enabled: true
start_at: beginning
operators:
- id: parser-containerd
type: regex_parser
regex: ^(?P<time>[^ ^Z]+Z) (?P<stream>stdout|stderr) (?P<logtag>[^ ]*) ?(?P<log>.*)$
timestamp:
layout: '%Y-%m-%dT%H:%M:%S.%LZ'
parse_from: attributes.time
- id: parser-pod-info
parse_from: attributes["log.file.path"]
regex: ^.*\/(?P<namespace>[^_]+)_(?P<pod_name>[^_]+)_(?P<uid>[a-f0-9\-]+)\/(?P<container_name>[^\._]+)\/(?P<restart_count>\d+)\.log$
type: regex_parser
# Handle line breaks
- type: recombine
is_last_entry: attributes.logtag == 'F'
combine_field: attributes.log
combine_with: ""
max_batch_size: 1000
max_log_size: 1048576
output: handle_empty_log
source_identifier: attributes["log.file.path"]
- field: attributes.log
id: handle_empty_log
if: attributes.log == nil
type: add
value: ""
- type: json_parser
parse_from: attributes.log
if: attributes.log matches "^\\{"
- type: add
field: attributes.instance
value: ${env:K8S_NODE_NAME}
- id: export
type: noop
hostmetrics:
collection_interval: 30s
root_path: /
scrapers:
cpu:
enabled:
metrics:
system.cpu.time:
enabled: true
system.cpu.utilization:
enabled: true
system.cpu.physical.count:
enabled: true
memory:
metrics:
system.memory.usage:
enabled: true
system.memory.utilization:
enabled: true
system.memory.limit:
enabled: true
load:
cpu_average: true
metrics:
system.cpu.load_average.1m:
enabled: true
system.cpu.load_average.5m:
enabled: true
system.cpu.load_average.15m:
enabled: true
network:
metrics:
system.network.connections:
enabled: true
system.network.dropped:
enabled: true
system.network.errors:
enabled: true
system.network.io:
enabled: true
system.network.packets:
enabled: true
system.network.conntrack.count:
enabled: true
system.network.conntrack.max:
enabled: true
hostmetrics/disk:
collection_interval: 1m
root_path: /
scrapers:
disk:
metrics:
system.disk.io:
enabled: true
system.disk.operations:
enabled: true
filesystem:
metrics:
system.filesystem.usage:
enabled: true
system.filesystem.utilization:
enabled: true
kubeletstats:
collection_interval: 30s
auth_type: "serviceAccount"
endpoint: "https://${env:K8S_NODE_NAME}:10250"
insecure_skip_verify: true
collect_all_network_interfaces:
node: true
pod: true
prometheus:
target_allocator:
collector_id: ${env:POD_NAME}
endpoint: http://otel-targetallocator
interval: 30s
config:
scrape_configs:
- job_name: otel-collector
scrape_interval: 30s
static_configs:
- targets:
- ${env:POD_IP}:8888
# https://opentelemetry.io/docs/collector/components/processor/
processors:
memory_limiter:
check_interval: 1s
limit_percentage: 75
spike_limit_percentage: 15
batch:
send_batch_max_size: 2048
send_batch_size: 1024
timeout: 1s
k8sattributes:
auth_type: "serviceAccount"
passthrough: false
filter:
node_from_env_var: K8S_NODE_NAME
extract:
metadata:
- k8s.namespace.name
- k8s.deployment.name
- k8s.replicaset.name
- k8s.statefulset.name
- k8s.daemonset.name
- k8s.cronjob.name
- k8s.job.name
- k8s.node.name
- k8s.pod.name
- k8s.pod.ip
- k8s.container.name
- container.id
labels:
- tag_name: owner
key: app.kubernetes.io/owner
from: pod
pod_association:
- sources:
- from: resource_attribute
name: k8s.pod.ip
- sources:
- from: resource_attribute
name: k8s.pod.name
- from: resource_attribute
name: k8s.namespace.name
- sources:
- from: connection
resource/instance:
attributes:
# Sets "instance" label on metrics
- action: upsert
key: service.instance.id
value: ${env:K8S_NODE_NAME}
transform/logs:
error_mode: ignore
log_statements:
# Keep only essential fields
- statements:
- set(log.attributes["namespace"], resource.attributes["namespace"])
- keep_matching_keys(log.attributes, "^(_.*|@.*|filename|log|service|job|agent|k8s\\.|container_name|instance|level|msg|message|namespace|pod_name|severity|severity_text|stream)")
- conditions: IsMap(log.body)
statements:
- keep_matching_keys(log.body, "^(level|msg|message|severity|severity_text)$")
transform/metrics:
error_mode: ignore
metric_statements:
- statements:
- set(datapoint.attributes["env"], resource.attributes["k8s.cluster.name"])
- set(datapoint.attributes["owner"], resource.attributes["owner"]) where resource.attributes["owner"] != nil
- set(datapoint.attributes["namespace"], resource.attributes["k8s.namespace.name"]) where resource.attributes["k8s.namespace.name"] != nil and resource.attributes["k8s.namespace.name"] != "kube-system"
- set(datapoint.attributes["pod"], resource.attributes["k8s.pod.name"]) where resource.attributes["k8s.pod.name"] != nil
- set(datapoint.attributes["container"], resource.attributes["k8s.container.name"]) where resource.attributes["k8s.container.name"] != nil
# Normalize label names for kube-state-metrics, ingress-nginx, etc.
- set(datapoint.attributes["namespace"], datapoint.attributes["exported_namespace"]) where datapoint.attributes["exported_namespace"] != nil and resource.attributes["k8s.namespace.name"] != "kube-system"
- set(datapoint.attributes["service"], datapoint.attributes["exported_service"]) where datapoint.attributes["exported_service"] != nil
- set(datapoint.attributes["pod"], datapoint.attributes["exported_pod"]) where datapoint.attributes["exported_pod"] != nil
- set(datapoint.attributes["container"], datapoint.attributes["exported_container"]) where datapoint.attributes["exported_container"] != nil
- statements:
- delete_key(datapoint.attributes, "exported_namespace") where datapoint.attributes["exported_namespace"] != nil
- delete_key(datapoint.attributes, "exported_service") where datapoint.attributes["exported_service"] != nil
- delete_key(datapoint.attributes, "exported_pod") where datapoint.attributes["exported_pod"] != nil
- delete_key(datapoint.attributes, "exported_container") where datapoint.attributes["exported_container"] != nil
transform/spanmetrics:
error_mode: silent
trace_statements:
- statements:
- set(span.attributes["namespace"], resource.attributes["k8s.namespace.name"]) where resource.attributes["k8s.namespace.name"] != nil
- set(span.attributes["namespace"], resource.attributes["service.namespace"]) where resource.attributes["service.namespace"] != nil
connectors:
spanmetrics:
aggregation_cardinality_limit: 100000
dimensions:
- name: namespace
- name: http.route
- name: http.method
- name: http.status_code
exclude_dimensions:
- status.code
- span.name
- span.kind
- service.name # The "job" label usually carries the same value
histogram:
explicit:
buckets:
- 10ms
- 50ms
- 100ms
- 250ms
- 500ms
- 1s
- 2s
- 5s
metrics_expiration: 1m
metrics_flush_interval: 30s
namespace: ""
# https://opentelemetry.io/docs/collector/components/exporter/
exporters:
debug: {}
prometheusremotewrite:
endpoint: http://vmmetrics-insert.victoriametrics:8480/insert/0/prometheus
timeout: 30s
retry_on_failure:
enabled: true
initial_interval: 10s
max_interval: 60s
max_elapsed_time: 300s
otlphttp/victoriametrics:
compression: gzip
encoding: proto
logs_endpoint: http://vmlogs-insert.victoriametrics:9481/insert/opentelemetry/v1/logs
tls:
insecure: true
loadbalancing:
routing_key: traceID
resolver:
dns:
hostname: trace-collector
protocol:
otlp:
tls:
insecure: true
# https://opentelemetry.io/docs/collector/configuration/#service
service:
telemetry:
logs:
encoding: json
level: info
extensions:
- health_check
# https://opentelemetry.io/docs/collector/configuration/#pipelines
pipelines:
logs:
receivers: [filelog, otlp]
processors:
- memory_limiter
- transform/logs
- batch
exporters: [otlphttp/victoriametrics]
metrics:
receivers:
- hostmetrics
- hostmetrics/disk
- kubeletstats
- spanmetrics
- prometheus
processors:
- memory_limiter
- resource/instance
- k8sattributes
- transform/metrics
- batch
exporters:
- prometheusremotewrite
traces:
receivers: [otlp]
processors:
- memory_limiter
- k8sattributes
- resource/instance
- transform/spanmetrics
- batch
exporters:
- spanmetrics
- loadbalancing
|
otel-collector RBAC
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
| apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: otel-collector
rules:
- apiGroups: [""]
resources:
- pods
- namespaces
- nodes
- nodes/metrics
- nodes/stats
verbs: ["get", "list", "watch"]
- apiGroups: ["apps"]
resources:
- replicasets
- deployments
- statefulsets
- daemonsets
verbs: ["get", "list", "watch"]
- apiGroups: ["batch"]
resources:
- jobs
- cronjobs
verbs: ["get", "list", "watch"]
- apiGroups: ["extensions"]
resources:
- replicasets
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: otel-collector
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: otel-collector
subjects:
- kind: ServiceAccount
name: otel-collector # Controller provisions the SA but not the ClusterRole
namespace: open-telemetry
|
trace-collector Deployment
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
| # trace-collector
apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
name: trace
spec:
mode: deployment
image: ghcr.io/open-telemetry/opentelemetry-collector-releases/opentelemetry-collector-contrib:0.140.1
autoscaler:
minReplicas: 2
maxReplicas: 6
targetCPUUtilization: 100
resources:
requests:
cpu: 100m
memory: 384Mi
limits:
cpu: 500m
memory: 1Gi
ports:
- name: grpc
port: 4317
targetPort: 4317
observability:
metrics:
enableMetrics: true
env:
- name: POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
- name: K8S_NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
config:
extensions:
health_check:
endpoint: ${env:POD_IP}:13133
# https://opentelemetry.io/docs/collector/components/receiver/
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
prometheus:
config:
scrape_configs:
- job_name: trace-collector
scrape_interval: 30s
static_configs:
- targets:
- ${env:POD_IP}:8888
# https://opentelemetry.io/docs/collector/components/processor/
processors:
memory_limiter:
check_interval: 1s
limit_percentage: 75
spike_limit_percentage: 15
batch:
send_batch_max_size: 2048
send_batch_size: 1024
timeout: 1s
tail_sampling:
policies:
- name: drop_spann
type: drop
drop:
drop_sub_policy:
- type: ottl_condition
name: sub-policy-0
ottl_condition:
error_mode: ignore
span:
- IsMatch(attributes["http.target"], "^(/health|/metrics|/ping|/ready)")
- name: keep_slow_requests
type: latency
latency:
threshold_ms: 1000
- name: keep_error_requests
type: numeric_attribute
numeric_attribute:
key: http.status_code
min_value: 400
max_value: 599
- name: keep_user_spans
type: ottl_condition
ottl_condition:
error_mode: ignore
span:
- attributes["user.id"] != nil and attributes["user.id"] != ""
- name: keep_1_percent_of_the_rest
type: probabilistic
probabilistic:
sampling_percentage: 1
# https://opentelemetry.io/docs/collector/components/exporter/
exporters:
debug: {}
otlphttp/victoriametrics:
compression: gzip
encoding: proto
traces_endpoint: http://vmtraces-insert.victoriametrics:10481/insert/opentelemetry/v1/traces
tls:
insecure: true
# https://opentelemetry.io/docs/collector/configuration/#service
service:
telemetry:
logs:
encoding: json
level: info
extensions:
- health_check
# https://opentelemetry.io/docs/collector/configuration/#pipelines
pipelines:
logs:
receivers: [otlp]
processors: []
exporters: [debug]
metrics:
receivers: [otlp]
processors: []
exporters: [debug]
traces:
receivers: [otlp]
processors:
- memory_limiter
- tail_sampling
- batch
exporters: [otlphttp/victoriametrics]
|
trace-collector RBAC
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
| apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: otel-targetallocator
rules:
- apiGroups: [""]
resources:
- pods
- services
- endpoints
- nodes
- nodes/metrics
- namespaces
- configmaps
verbs: ["get", "list", "watch"]
- apiGroups: ["discovery.k8s.io"]
resources:
- endpointslices
verbs: ["get", "list", "watch"]
- apiGroups: ["monitoring.coreos.com"]
resources:
- probes
- scrapeconfigs
verbs: ["get", "list", "watch"]
- apiGroups: ["monitoring.coreos.com"]
resources:
- servicemonitors
- podmonitors
verbs: ["*"] # targetAllocator throws a warning if this isnt permissive
- apiGroups: ["opentelemetry.io"]
resources:
- opentelemetrycollectors
verbs: ["get", "list", "watch"]
- apiGroups: ["networking.k8s.io"]
resources:
- ingresses
verbs: ["get", "list", "watch"]
- nonResourceURLs:
- /apis
- /apis/*
- /api
- /api/*
- /metrics
verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: otel-targetallocator
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: otel-targetallocator
subjects:
- kind: ServiceAccount
name: otel-targetallocator # Controller provisions the SA but not the ClusterRole
namespace: open-telemetry
|