Troubleshooting: Different Valid Tracing ID are Mingled Together¶

Last week, users have reported a weird issue because they found:

Server has received 10+ same requests to the same API in a single trace, and it's impossible.
Server made another RPC call when serving, the egress logging interceptor reports 2 different tracing ID. The first one is set to the context logger after parsing the trace id from client, and the second one is retrieved from the context. They should be the same.
The data put inside the context(will be propagated by underlying SDK when RPC) cannot be propagated to the server.

This issue is complicated due to the involved components and the cross services. It takes me a lot of time to troubleshoot and finally I found the root cause successfully. The issue happened because user used gin.Context as context.Context directly inside a spawned go routine, however, gin tries to reuse the gin.Context for another request. As a result, the go routine spawned by previous request will be affected by the new request, and the request id mingles.

Background¶

Inside our company, we have an internal microservice system. Due to history reasons, there are a huge amount of HTTP services serving behind the gateway and they call each other. This usage prevents the service governance because it's impossible to track the call graph for a huge system. It also cannot enjoy all system layer features such as logging, tracing, monitoring, rate limit and so on.

Onboarding to the internal microservice system sounds great, but it requires numerous human efforts to change the codebase. To mitigate the migration, trans-protocol solution between HTTP and internal RPC system was proposed and designed for HTTP servers.

Gin framework and the outer wrapper¶

Framework could be run directly to listen the port and serve the requests. But in our case, additional outer wrapper is provided to integrate common infrastructures such as tracing, monitoring and logging along with the trans-protocol support. As a result, rather than running engine.Run we have start another new server to do staffs on the top of users' http server and then call engine.ServeHTTP to process.

Diagnosing¶

Basic Assumption¶

The troubleshooting takes me a lot of time, as I have the following assumptions and each of them should never break:

tracing span is restored via reading http header and set it inside the context.
the context passed in is immutable at caller side.

Given these 2 fundamental assumptions, looks like the issue is impossible to happen inside the context. Later you will see actually the 2^nd assumption is wrong because it very depends on the concrete implementations.

But for now, let's continue because I have realized it later as well. I change my direction to check whether it's possible the different tracing headers causes this issue.

Mixing Usage of Different Tracing Headers¶

To make the backward compatible, tracing interceptor supports 2 kind of headers while the outer server supports one of them only.

Given the header_tracing_1 and header_tracing_2 and server supports header_tracing_1 only. I think it may be caused that the context logger is set with the outer interceptor restored tracing span, but the context used when sending RPC request is populated by the custom interceptor. The issue is caused by the mixing usage of different tracing headers.

However, this cannot explain why multiple requests appear in the same request. And checking the code, it's not true assumption.

How to continue?¶

The reasonable assumption is wrong, so I need to find another way to handle it. I have to read the code instead of asking users some details because they cannot realize the issue.

Then, I found the RPC call is made asynchronized during serving a request. This challenged my assumption "the context passed in is immutable at caller side" as it very depends on the implementation.

For example, if caller puts a pointer instead of value inside context.Value, it could modify it at the caller side.

However, the span context is stored inside context is value, instead of pointer. We need to check whether the context itself(instead of span context inside) could be changed. This might happen when retrieving context is lazy, which is affected by the underlying resources.

After checking the code, users have passed the gin.Context as context.Context during RPC. It might be the issue of using gin.Context.

gin.Context(): Retrieve the Context form Request¶

The gin.Context.Value retrieves the value from the underlying request by c.Reqeust.Context(). Given the RPC is made asynchronized, it might be the case that the context is reused and the request field changes when serving another request.

// Value returns the value associated with this context for key, or nil
// if no value is associated with key. Successive calls to Value with
// the same key returns the same result.
func (c *Context) Value(key any) any {
    if key == ContextRequestKey {
        return c.Request
    }
    if key == ContextKey {
        return c
    }
    if keyAsString, ok := key.(string); ok {
        if val, exists := c.Get(keyAsString); exists {
            return val
        }
    }
    if !c.hasRequestContext() {
        return nil
    }
    return c.Request.Context().Value(key)
}

It sounds reasonable, so we can check how the context is re-used to ensure the assumption above is correct.

We can see it in engine.ServeHTTP, the gin.Context is returned to the pool when the request is done, however, in our case, the asynchronized RPC call still holds the context and wait for scheduler to execute it.

// ServeHTTP conforms to the http.Handler interface.
func (engine *Engine) ServeHTTP(w http.ResponseWriter, req *http.Request) {
    c := engine.pool.Get().(*Context)
    c.writermem.reset(w)
    c.Request = req
    c.reset()

    engine.handleHTTPRequest(c)

    engine.pool.Put(c)
}

When the next request comes, it modifies the Reqeust field of gin.Context by assignment, which affects the value of gin.Context.Value() call inside the asynchronized RPC call. The side effect and unaware modification is harmful.

gin.Context Between 2 Requests and Spawned go routines¶

To help better understand why this issue happens among 2 different go routines, I draw a diagram below.

Conclusion¶

Next time, when seeing multiple requests are mixed together into a trace, the first direction is to check whether the context is reused wrongly. It's also a good experience to check the reusing logic if some information is mixed together wrongly.

The fix is simple, use gin.Context.Request.Context() or use gin.Context.Copy() to make the context owned by a request uniquely.

The gin docs doesn't talk anything about this possible issue. It's quite a subtle pitfall:(