[for devs] New logging infrastructure #8457

jmazzitelli · 2025-05-24T12:24:09Z

jmazzitelli
May 24, 2025
Maintainer

this post is mainly for Kiali developers

TL;DR

I wanted to give some details of the new logging infrastructure in use inside the Kiali Server. In the past we used our simple, easy to use log.Debug API (which we can still use today, it is not going away), but that does not allow us to provide much detail to our logs. We are going to start using structured logging and use that to track logs across requests using the zerolog API that we already have (but never fully utilized).

See the "How to use the new logger" section below for the important part.

What we used in the past

In the past, logging messages in Kiali has been simple and straightforward -- using log.Debugf() and related functions (e.g. log.Tracef(), log.Info(), etc). Under the covers, this is using the zerolog logger. But all this allowed us to do was log a string message at a specific level (debug, info, etc). At most we are able to build the string dynamically using the "f" versions of the API (which is essentially Stringf expressions).

What we can use today

Today we are exposing more of the zerolog logger API to get better structured logging, along with being able to trace logs throughout an individual request. This is going to make it easier to debug and diagnose problems within the server.

What this means is that we will now have individual zerolog loggers given to our HTTP handlers, with those loggers attached to the HTTP request objects passed to the handlers. That logger can then be passed around inside context.Context objects. These are actual zerolog loggers, so you need to use the zerolog API to log messages (which is not horrible, but slightly different from the simple log.Debugf that we have been using).

How to use the new logger

Here's the important part for devs to know. To use this new logger infrastructure, you now have to obtain the logger in some way - usually either from a context within an http.Request (i.e. http.Request.Context()) or from a context.Context argument passed to a function.

Here's an example in our graph code of how to get the context from a request. Notice it simply gets the context from the request and passes it down to another function - that function will then use the context to obtain the logger so it can log messages.

	return func(w http.ResponseWriter, r *http.Request) {
		// ...
		code, payload := api.GraphNamespaces(r.Context(), business, o)

At that point, api.GraphNamespaces has a context that contains the logger it should use to log messages. That function should pass the context to other functions it calls, so those other functions can use the same logger to log messages.

Here's an example of how to get the logger from a context, and then using that logger to log a message.

func BuildNamespacesTrafficMap(ctx context.Context, o graph.TelemetryOptions, globalInfo *graph.GlobalInfo) graph.TrafficMap {
	// ...
	zl := log.FromContext(ctx)
	zl.Trace().Msgf("Build [%s] graph for [%d] namespaces [%v]", o.GraphType, len(o.Namespaces), o.Namespaces)

Notice the slightly different API used to log messages. Rather than our global log.Tracef() function we have been using, now you use the zerolog API directly. First call the log level method (in this case Trace()) and then call Msgf() passing in the same dynamic Stringf expressions we have already been using. So log.Tracef() should now be zl.Trace().Msgf(). Slightly different, but not much different, so we can easily refactor our logging code when switching over to the new logging API.

NOTE ABOUT THE VARIABLE NAME zl: I use the variable name zl (rather than log) because it makes it easy to visually see where code has been refactored to use the new logging API. It is much easier to see zl in the code and immediately know, "Ah, this is already using zerolog and the new logger stuff". Seeing log as variable names as well as the package name will be confusing and make the devs think too hard :) It's just a little thing that makes it slightly easier when refactoring and adding this code.

What do the logs look like now (i.e. "Why are we doing this?")

Not only does this add structured data to the logs (providing easier ways to search and filter log messages), but it also allows us to track log messages for a particular request, allowing for easier debugging and diagnosis of problems.

Here's an example of what a log message now looks like for the graph generation (note: I have enabled JSON logging (spec.deployment.logger.log_format=json) because it is easier to read the data for documentation, but everything works basically the same with our default text format; you just use grep expressions rather than jq to filter and search the logs):

{
  "level": "trace",
  "route": "GraphNamespaces",
  "route-pattern": "/api/namespaces/graph",
  "group": "graph",
  "request-id": "d0or9k4dbm7c739fv8jg",
  "ts": "2025-05-24T12:00:16Z",
  "msg": "Build [versionedApp] graph for [5] namespaces [map[default:{default 1m0s false false} dex:{dex 1m0s false false} ingress-nginx:{ingress-nginx 1m0s false false} istio-system:{istio-system 1m0s false true} oauth2-proxy:{oauth2-proxy 1m0s false false}]]"
}

Notice an important piece of data we now have: request-id (this log came from request identified as d0or9k4dbm7c739fv8jg). If I then filter my logs by searching for that request-id, I can follow that graph generation request from the handler all the way down to the appenders and Prometheus queries, and see what logs were generated just for that request. Below are only some of the logs just so you get the idea; I did not show a bunch of messages here so I don't spam this thread as much as I already have ;).

kubectl logs -n istio-system deployments/kiali | jq -rR 'fromjson? | select(."request-id" == "d0or9k4dbm7c739fv8jg")'

{
  "level": "trace",
  "route": "GraphNamespaces",
  "route-pattern": "/api/namespaces/graph",
  "group": "graph",
  "request-id": "d0or9k4dbm7c739fv8jg",
  "ts": "2025-05-24T12:00:16Z",
  "msg": "Build [versionedApp] graph for [5] namespaces [map[default:{default 1m0s false false} dex:{dex 1m0s false false} ingress-nginx:{ingress-nginx 1m0s false false} istio-system:{istio-system 1m0s false true} oauth2-proxy:{oauth2-proxy 1m0s false false}]]"
}
{
  "level": "trace",
  "route": "GraphNamespaces",
  "route-pattern": "/api/namespaces/graph",
  "group": "graph",
  "request-id": "d0or9k4dbm7c739fv8jg",
  "ts": "2025-05-24T12:00:16Z",
  "msg": "Build traffic map for namespace [{istio-system 1m0s false true}]"
}
{
  "level": "trace",
  "route": "GraphNamespaces",
  "route-pattern": "/api/namespaces/graph",
  "group": "graph",
  "request-id": "d0or9k4dbm7c739fv8jg",
  "query": "round(sum(rate(istio_requests_total{reporter=\"source\",source_workload_namespace!=\"istio-system\",destination_workload_namespace=\"unknown\",destination_workload=\"unknown\",destination_service=~\"^.+\\\\.istio-system\\\\..+$\"} [60s])) by (source_cluster,source_workload_namespace,source_workload,source_canonical_service,source_canonical_revision,destination_cluster,destination_service_namespace,destination_service,destination_service_name,destination_workload_namespace,destination_workload,destination_canonical_service,destination_canonical_revision,request_protocol,response_code,grpc_response_status,response_flags) > 0,0.001)",
  "timer": "PrometheusProcessingTime",
  "duration": "1.751165ms",
  "ts": "2025-05-24T12:00:16Z",
  "msg": "PromQuery: queryTime=[2025-05-24 12:00:16], queryTime.Unix=[1748088016])"
}
// ...
{
  "level": "trace",
  "route": "GraphNamespaces",
  "route-pattern": "/api/namespaces/graph",
  "group": "graph",
  "request-id": "d0or9k4dbm7c739fv8jg",
  "appender": "serviceEntry",
  "namespace": "istio-system",
  "timer": "GraphAppenderTime",
  "duration": "701ns",
  "ts": "2025-05-24T12:00:16Z",
  "msg": "Namespace graph appender time"
}
{
  "level": "trace",
  "route": "GraphNamespaces",
  "route-pattern": "/api/namespaces/graph",
  "group": "graph",
  "request-id": "d0or9k4dbm7c739fv8jg",
  "appender": "deadNode",
  "namespace": "istio-system",
  "timer": "GraphAppenderTime",
  "duration": "70ns",
  "ts": "2025-05-24T12:00:16Z",
  "msg": "Namespace graph appender time"
}
//...
{
  "level": "trace",
  "route": "GraphNamespaces",
  "route-pattern": "/api/namespaces/graph",
  "group": "graph",
  "request-id": "d0or9k4dbm7c739fv8jg",
  "ts": "2025-05-24T12:00:16Z",
  "msg": "Generating config for [common] graph..."
}
{
  "level": "trace",
  "route": "GraphNamespaces",
  "route-pattern": "/api/namespaces/graph",
  "group": "graph",
  "request-id": "d0or9k4dbm7c739fv8jg",
  "ts": "2025-05-24T12:00:16Z",
  "msg": "Done generating config for [common] graph"
}
{
  "level": "trace",
  "route": "GraphNamespaces",
  "route-pattern": "/api/namespaces/graph",
  "group": "graph",
  "request-id": "d0or9k4dbm7c739fv8jg",
  "graph-kind": "namespace",
  "graph-type": "versionedApp",
  "inject-service-nodes": "true",
  "timer": "GraphMarshalTime",
  "duration": "14.817µs",
  "ts": "2025-05-24T12:00:16Z",
  "msg": "Graph marshal time"
}
{
  "level": "trace",
  "route": "GraphNamespaces",
  "route-pattern": "/api/namespaces/graph",
  "group": "graph",
  "request-id": "d0or9k4dbm7c739fv8jg",
  "graph-kind": "namespace",
  "graph-type": "versionedApp",
  "inject-service-nodes": "true",
  "timer": "GraphGenerationTime",
  "duration": "17.320227ms",
  "ts": "2025-05-24T12:00:16Z",
  "msg": "Namespace graph generation time"
}
{
  "level": "trace",
  "route": "GraphNamespaces",
  "route-pattern": "/api/namespaces/graph",
  "group": "graph",
  "request-id": "d0or9k4dbm7c739fv8jg",
  "status-code": "200",
  "timer": "APIProcessingTime",
  "duration": "17.662317ms",
  "ts": "2025-05-24T12:00:16Z",
  "msg": "API processing time"
}

Here you see the entire graph generation for this particular request took 17.66ms. You can see what Prometheus queries were performed to build this specific graph, and how long each of those queries took.

What to do now?

Only parts of the server code has been converted to use this new logging infrastructure. The graph generation code and some of the validation code have been converted because those are the ones with the most impact on performance (and thus the most important to want this new data for). There are other parts of the code that have been converted, too - like the Prometheus and Tracing code.

We can still use the original log.Debugf() APIs. That still works. You just won't get all the nice request-level logging. If you find code where you think it will be helpful to have that, you can convert to the new logging API at that time.

jmazzitelli · 2025-05-25T02:10:56Z

jmazzitelli
May 25, 2025
Maintainer Author

See this PR for new docs that are coming that discuss analyzing the new logs: kiali/kiali.io#881

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[for devs] New logging infrastructure #8457

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[for devs] New logging infrastructure #8457

Uh oh!

Uh oh!

jmazzitelli May 24, 2025 Maintainer

TL;DR

What we used in the past

What we can use today

How to use the new logger

What do the logs look like now (i.e. "Why are we doing this?")

What to do now?

Replies: 1 comment

Uh oh!

jmazzitelli May 25, 2025 Maintainer Author

jmazzitelli
May 24, 2025
Maintainer

jmazzitelli
May 25, 2025
Maintainer Author