Go: log.Fatalf is Bad for Framework Maintainers¶

In the open source framework, as long as maintainers can ensure the issues are caused by users instead of the framework, usually they could provide some suggestive guides and prevent themselves from delving into users code.

However, as a company scope framework maintainer, things differ greatly. As you claimed your framework helps them to write better code based on the company platforms, they're likely to rely on the maintainers to figure out some weird issues once it's related with the framework.

This blog records how I help to trouble shoot the weird issue of keeping restarting and failed to deploying caused by the log.Fatalf for our framework users, and my opinion why the log.Fatalf is bad.

Why log.Fatalf Is Bad for Framework Maintainers?¶

I put my idea first, and then show the troublesome caused by log.Fatalf along with the unexpected writing log failure to guard my idea.

From a framework maintainer perspective, the framework aims to manage users' business logic. That's why our framework uses toolchain to generate and control the main function.

In the other word, users should provide the computation and the framework should schedule and call their computation and handles the error correctly with proper logs. If computation encounter an error and cannot proceed, it just needs to return error and let the framework resolve the error.

For example, the computation might be allowed to fail and not necessary cause to exit the whole process. By moving this configuration from user's computation to the framework, they get the support easily. However, the log.Fatalf breaks this flexibility because it calls the os.Exit system call directly and the framework doesn't have any mechanism to restore it like the panic.

For framework, it means computations could skip the framework and exit the whole process directly, and leave the framework maintainers great trouble for error checking. Because the framework control the main function, we won't want the users computation has the ability shutdown the whole process directly. The case deteriorates when the log fails to write into a writer, as the real case below I mentioned.

In short, the framework author should think the log.Fatalf(and any function calls os.Exit) is bad and suggest users don't use it.

The Appearance of Deployment Failure¶

One of our framework users reported his service keeps restarting and finally failed to deploy due to the timeout. As a result, I was involved to help check. After applying the permission of live service, I found nothing useful log:( The error logs from our framework disappeared and what I knew is the exit code 1.

What's worse, it's a live exclusive issue. The test environment doesn't have it at all. Even the liverish environment couldn't reproduce this error. Usually, due to the isolation of live and test, configuration might cause this issue. However, that's too business specific logic and unluckily I knew nothing about it of course.

Trouble Shooting: How I finally Solve the Issue¶

Check Log and Exit Code, but Hints Aren't Enough¶

Because our framework will always log during exiting, it was unlikely the process exits through our framework normally. Then, I suspected users called system call such as log.Fatalf directly inside their codebase, and I did find some of log.Fatalf call for them.

However, log.Fatalf will print an error to stderr, while there is no such kind of error message. Looks like the log.Fatalf isn't the root cause as well. At that time, I can only suggest users to change their log.Fatalf and returns error to our framework, which means the framework will handle the error properly and exit finally for them.

The hint isn't strong enough to persuade users, so I need to check again. The exit code might be helpful, so I checked our framework exit code. However, both log.Fatalf and our framework exit with exit code 1. I cannot even ensure where does the exit codes form, framework or users' code. Note that even though before framework exiting, it logs many, but what if the logs are lost?

Hence, I got a painful idea to improve in the future:

as a framework which manages users' main function(yes, main function is removed when using our framework), it's better to define a different exit code set to avoid mixing with popular exit code.

To ensure this, I changed the exit code of our framework, updated users service and deployed again. Then, I can make sure it's caused by the user's code, not ours.

Delve in Live Container¶

I suggested them to check the code by themselves, but they still have no idea about it so they pushed the question back to me several weeks later. I understood that somehow because they don't know what happened as well, and I also gave a kinda ambiguous suggestion to them.

At this time, they provide a live container which could safely use to reproduce the issue. Then I checked it again by using delve directly to debug the service in live. To be honest, that's quite dangerous and usually impossible to debug in live. Luckily, this service is not important and the qps is extremely low, which provides the possibility to debug.

dlv exec ./bin/binary
(dlv) break os.Exit
Breakpoint 1 set at 0x5784c6 for os.Exit() 
  /root/bin/sdk/go1.19.9/src/os/proc.go:62

(dlv) continue 
> os.Exit() /root/bin/sdk/go1.19.9/src/os/proc.go:62 
  (hits goroutine(232):1 total:1) (PC: 0x5784c6)
    73:     if err != nil {
=>  74:         log.Fatalf("fail to connect database, the error is '%v'", err)
    75:     }
(dlv) stack
 0  0x00000000005784c6 in os.Exit
    at  /root/bin/sdk/go1.19.9/src/os/proc.go:62
 1  0x00000000007d86c5 in log.Fatalf
    at /root/bin/sdk/go1.19.9/src/log/log.go:375
 2  0x00000000017474b1 in blablabla/internal/xxx.initXX
    at ./internal/xxx/init.go:74
... ignore many lines

(dlv) frame 2
> os.Exit() root/bin/sdk/go1.19.9/src/os/proc.go:62 
    (hits goroutine(232):1 total:1) (PC: 0x5784c6)
Frame 2: ./internal/xxx/init.go:74 (PC: 17474b1)
    72:     
    73:     if err != nil {
=>  74:         log.Fatalf("fail to connect database, the error is '%v'", err)
    75:     }

By this way, I knew the process exits due to the call of log.Fatalf, and could provide the precise callstack to users directly to fix.

However, this isn't enough because "where does the log.Fatalf go?". If the log prints normally, it won't disturb me and spend my time to trouble shooting.

Where Does Log Go?¶

The log.Fatalf is very simple while using Write to write the io.Writer which is os.Stderr here. See the implementation, the error of Output is omitted. So the log is lost might be caused by the writing error due to the stderr goes wrong as the container platform might manipulate it silently, which is very internal knowledge at their side.

var std = New(os.Stderr, "", LstdFlags)

func (l *Logger) Fatalf(format string, v ...any) {
    l.Output(2, fmt.Sprintf(format, v...))
    os.Exit(1)
}
func (l *Logger) Output(calldepth int, s string) error {
    calldepth++ // +1 for this frame.
    return l.output(0, calldepth, func(b []byte) []byte {
        return append(b, s...)
    })
}

// output can take either a calldepth or a pc to get source line information.
// It uses the pc if it is non-zero.
func (l *Logger) output(pc uintptr, calldepth int,
    appendOutput func([]byte) []byte) error {
    // ignore many lines, added by xieyuschen
    l.outMu.Lock()
    defer l.outMu.Unlock()
    _, err := l.out.Write(*buf)
    return err
}

To demonstrate it, we can write some bytes into to stderr and check the result. Now we add a breakpoint at user's code to try to write into stderr:

(dlv) break internal/xxx/init.go:74
Breakpoint 1 set at 0x17473d0 for 
  blabla/internal/xxx.initXX() ./internal/xxx/init.go:74
(dlv) continue 
> blabla/internal/xxx.initXX() ./internal/xxx/init.go:74 
  (hits goroutine(234):1 total:1) (PC: 0x17473d0)
    73:     if err != nil {
=>  74:         log.Fatalf("fail to connect database, the error is '%v'", err)
    75:     }

(dlv) call os.Stderr.Write([]byte("helloworld"))
> blabla/internal/xxx.initXX() ./internal/xxx/init.go:74 
  (hits goroutine(234):1 total:1) (PC: 0x17473d0)
Values returned:
    n: 0
    err: error(*io/fs.PathError) *{
        Op: "write",
        Path: "/dev/stderr",
        Err: error(syscall.Errno) EFAULT (14),}

Then we found that the stderr is broken to write, that's why the log.Fatalf doesn't leave any log. After that, I reported this error to the platform team for them to check further.

Some Fall pits during fmt.Print¶

call not at safe point: When we break at the log.go and try to call a fmt, it complains to me you're not at a safe point. To solve this, make a function call inside user's code.

(dlv) call std.Output(2, "helloworld")
> log.Fatalf() /root/bin/sdk/go1.19.9/src/log/log.go:373 (PC: 0x7d8666)
Command failed: call not at safe point

not enough arguments:

I don't know why this happens, so I tried another way.

(dlv) call fmt.Fprintf(os.Stderr, "Debug: Reached main.go:8\n")
> blabla/internal/xxx.initXX() ./internal/xxx/init.go:74 
  (hits goroutine(234):1 total:1) (PC: 0x17473d0)
Command failed: not enough arguments

can not convert value of type *os.File to io.Writer

Seems the dlv function call doesn't allow you to make an implicit conversion, even though the *os.File satisfies the io.Writer interface.

(dlv) call fmt.Fprintf(os.Stderr, "Debug: Reached %s:%d\n", "init.go")
> blabla/internal/xxx.initXX() ./internal/xxx/init.go:74 
  (hits goroutine(234):1 total:1) (PC: 0x17473d0)
Command failed: can not convert value of type *os.File to io.Writer