Following an OS upgrade, the site started showing spikes of the OpenSSL's "shutdown while in init error" coming from Trilogy, the MySQL adapter. Debugging was difficult because it happened in a C library and Ruby stack traces don't go that deep. I patched OpenSSL to append a stack trace from C land to the error message (see my proof of concept at pudiva/openssl-backtrace), which bubbled up to our logs and solved the mystery. The error actually originated in libcurl, but was caught in Trilogy because OpenSSL has a global error queue and errors weren't cleared between users. We filed patches to libcurl and Trilogy to clear errors, and were confident that the OS upgrade was safe.
After upgrading the site's OS, we started seeing eventual spikes of the OpenSSL error "shutdown while in init" coming from Trilogy, the MySQL adapter. We worried that it might cause an availability incident or indicate a security problem.
Reverting the OS upgrade did make the error go away, but since the upgrade was a security priority for other reasons and users didn't seem to be impacted, we decided to accept it while we investigated.
A colleague downgraded OpenSSL to the same version in the known-good OS, but it didn't work, indicating that the problem was in its usage somewhere else. Another colleague patched Trilogy not to call SSL_shutdown() if the connection hadn't been fully established, and another one tried many things to fix and debug the error, but nothing seemed to work. Meanwhile, I tried to reproduce the error by quickly writing a minimal mock MySQL server and Trilogy client in C, but it never triggered the error... and then as I failed to reproduce it, I had another idea.
Our difficulty with debugging this error is that it originated in OpenSSL, which is written in C, so our Ruby stack traces didn't go deep enough to show who was calling SSL_shutdown() inappropriately... but what if we had a stack trace from C land?
While the C standard doesn't specify a way to output stack traces, the GNU C library does through the backtrace() function. I opened the OpenSSL sauce code and started from the definition of the SSL_shutdown() function. I read some code and docs and figured out that errors were built in the ERR_put_error() function and that I could use ERR_add_error_data() to append arbitrary information to them. Then I came up with the following code (see my working proof of concept at pudiva/openssl-backtrace):
#include <execinfo.h>
void add_backtrace()
{
char buf[4096] = "";
int buf_len = 0;
/* this macro should prevent segfaults - tested with buf[100] */
#define BUF_PRINTF(...) \
do { \
if (buf_len < sizeof (buf) - 1) \
{ \
int n = snprintf(buf + buf_len, sizeof (buf) - buf_len, __VA_ARGS__); \
if (n > 0) \
buf_len += n; \
} \
} while (0)
/* backtrace stuff */
void* bt_array[1024];
int bt_size;
char** bt_strings;
int i;
bt_size = backtrace(bt_array, (sizeof (bt_array) / sizeof (bt_array[0])));
bt_strings = backtrace_symbols(bt_array, bt_size);
BUF_PRINTF(" Obtained %d stack frames:\n", bt_size);
if (bt_strings != NULL)
for (i = 0; i < bt_size; i++)
BUF_PRINTF("%s\n", bt_strings[i]);
else
BUF_PRINTF("strings == NULL!\n");
BUF_PRINTF("\n");
ERR_add_error_data(1, buf);
free(bt_strings);
#undef BUF_PRINTF
}
I appended it to ERR_put_error(), deployed to production and waited. A while later, we got a stack trace like this:
...SSL_shutdown:shutdown while in init:../ssl/ssl_lib.c:2092: Obtained 195 stack frames:
/lib/x86_64-linux-gnu/libcrypto.so.1.1(+0x1a6378) [0x7f6ec3b8e378]
/lib/x86_64-linux-gnu/libcrypto.so.1.1(ERR_put_error+0x1bb) [0x7f6ec3b8ee3d]
/lib/x86_64-linux-gnu/libssl.so.1.1(SSL_shutdown+0xf7) [0x7f6ec3d978bd]
/lib/x86_64-linux-gnu/libcurl.so.4(+0x66b25) [0x7f6eb3753b25]
...
Finally the stack trace revealed that it was libcurl who was generating "shutdown while in init" errors, not Trilogy! What happened is that OpenSSL has a global error queue and the error was not being cleared between libcurl and Trilogy, so it ended up being reported by the latter, terribly misleading us. We filed patches to libcurl and Trilogy, the error was gone and we were confident that the OS upgrade was safe.
Because of reasons, I had to insert the OpenSSL patch as a string literal in a Dockerfile. I tried to use Heredoc syntax, but, for my delight, our version of Docker didn't support it and my patch ended up silently truncated to nothing. My beautiful workaround was to rewrite
cat > openssl-backtrace.patch <<'EOF'
as a series of
printf '%s\n' 'line1' >> openssl-backtrace.patch
My stack trace trickery did solve the mystery and was generally well tolerated, but we later found that it affected one endpoint. Another colleague pointed out that Ruby's OpenSSL bindings sometimes like to "try stuff until it works", but my code made the error handling much more expensive by including stack traces, causing the endpoint to time out in some cases. We had kept my patch online for a bit just in case another OpenSSL error would show up, so we immediately reverted it.
In retrospect, I see that this and potential other issues could've been prevented by restricting the stack trace generation to just the error we were interested in: "shutdown while in init".