Datadog automated instrumentation for .NET works by rewriting the IL of interesting methods to emit traces that are then sent to the back-end. This is a complex piece of logic, written using the profiler API, and ridden with corner-cases. And as always with complex code, bugs are bound to happen, and those can be very difficult to diagnose.

As it turns out, we had customer reports of applications throwing InvalidProgramException when using our instrumentation. This exception is thrown when the JIT encounters invalid IL code, most likely emitted by our profiler. The symptoms were always the same: upon starting, the application had a random probability of ending up in a state where the exception would be thrown every time a method in particular was called. When that happened, restarting the application fixed the issue. The affected method would change from one time to the other. The issue was bad enough that the customers felt the need to report it, and rare enough that it couldn’t be reproduced at will. Yikes.

Since we couldn’t reproduce the issue ourselves, I decided to ask for a memory dump, and received one a few weeks later (random issues are random). This was my first time debugging this kind of problem, and it proved to be quite dodgy, so I figured out it would make a nice subject for an article.

The article ended up being a lot longer than I thought, so I divided it into 3 different parts.

Preliminary exploration

The memory dump had been captured on a Linux instance. I opened it with dotnet-dump, running on WSL2. The first step was to find what method was throwing the exception. Usually, this kind of memory dump is captured when the first-chance exception is thrown, and that exception is visible in the last column when using the clrthreads command. But I couldn’t find any:

> clrthreads

ThreadCount:      19
UnstartedThread:  0
BackgroundThread: 10
PendingThread:    0
DeadThread:       7
Hosted Runtime:   no
                                                                                                            Lock
 DBG   ID     OSID ThreadOBJ           State GC Mode     GC Alloc Context                  Domain           Count Apt Exception
   0    1        1 0000000001B318D0  2020020 Preemptive  0000000000000000:0000000000000000 0000000001AF60B0 0     Ukn
   5    2        a 0000000001AD1FE0    21220 Preemptive  00007FCE9864C630:00007FCE9864E2D0 0000000001AF60B0 0     Ukn (Finalizer)
   7    3        c 00007FCE88000C50  1020220 Preemptive  0000000000000000:0000000000000000 0000000001AF60B0 0     Ukn (Threadpool Worker)
   9    6       11 00007FCE8C17B250    21220 Preemptive  0000000000000000:0000000000000000 0000000001AF60B0 0     Ukn
  10   10       15 0000000002F8FAF0    21220 Preemptive  0000000000000000:0000000000000000 0000000001AF60B0 0     Ukn
  11   11       16 0000000002ECF840  2021020 Preemptive  00007FCE98650E48:00007FCE986522D0 0000000001AF60B0 0     Ukn
XXXX   29        0 00007FCE44002BE0  1031820 Preemptive  0000000000000000:0000000000000000 0000000001AF60B0 0     Ukn (Threadpool Worker)
XXXX   24        0 00007FCE5C006F60  1031820 Preemptive  0000000000000000:0000000000000000 0000000001AF60B0 0     Ukn (Threadpool Worker)
XXXX   16        0 00007FCE800017C0  1031820 Preemptive  0000000000000000:0000000000000000 0000000001AF60B0 0     Ukn (Threadpool Worker)
XXXX    4        0 00007FCE80002830  1031820 Preemptive  0000000000000000:0000000000000000 0000000001AF60B0 0     Ukn (Threadpool Worker)
XXXX   23        0 00007FCE5C020CF0  1031820 Preemptive  0000000000000000:0000000000000000 0000000001AF60B0 0     Ukn (Threadpool Worker)
XXXX   20        0 00007FCE7403F840  1031820 Preemptive  0000000000000000:0000000000000000 0000000001AF60B0 0     Ukn (Threadpool Worker)
XXXX   15        0 00007FCE5803B2F0  1031820 Preemptive  0000000000000000:0000000000000000 0000000001AF60B0 0     Ukn (Threadpool Worker)
  12   27      6d6 00007FCE74011A10  1021220 Preemptive  00007FCE98C7F330:00007FCE98C7FA28 0000000001AF60B0 0     Ukn (Threadpool Worker)
  13   21      700 00007FCE740A0030  1021220 Preemptive  00007FCE98D10578:00007FCE98D11168 0000000001AF60B0 0     Ukn (Threadpool Worker)
   8   28       10 00007FCE78000C50    20220 Preemptive  0000000000000000:0000000000000000 0000000001AF60B0 0     Ukn
  14   26      71b 00007FCE3C007610  1021220 Preemptive  00007FCE98E81908:00007FCE98E831C8 0000000001AF60B0 0     Ukn (Threadpool Worker)
  15   25      71d 00007FCE3C008750  1021220 Preemptive  00007FCE98E7BAC8:00007FCE98E7D1C8 0000000001AF60B0 0     Ukn (Threadpool Worker)
  16   17      71f 00007FCE400292B0  1021220 Preemptive  00007FCE98CFB800:00007FCE98CFD168 0000000001AF60B0 0     Ukn (Threadpool Worker)

I then decided to have a look at the notes sent along with the dump (yeah, I know, I should have started there), and understood why I couldn’t see the exception: the customer confirmed that the issue was occurring and just captured the memory dump at a random point in time. Can’t blame them: I don’t even know how to capture a memory dump on first-chance exceptions on Linux, and it doesn’t seem to be supported by procdump. If somebody from the .NET diagnostics team reads me…

That’s OK though. If no garbage collection happened since the exception was thrown, it should still be hanging somewhere around on the heap. To find out, I used the dumpheap -stat command:

> dumpheap -stat -type InvalidProgramException

Statistics:
              MT    Count    TotalSize Class Name
00007fcf109e7428        3          384 System.InvalidProgramException
Total 3 objects

Three of them, great. I used the dumpheap -mt command to get their address, with the value from the “MT” column. Then I used the printexception (pe) command to get the stacktrace associated to the exception:

> dumpheap -mt 00007fcf109e7428

         Address               MT     Size
00007fce985a4358 00007fcf109e7428      128
00007fce98a8b008 00007fcf109e7428      128
00007fce98b560f0 00007fcf109e7428      128

Statistics:
              MT    Count    TotalSize Class Name
00007fcf109e7428        3          384 System.InvalidProgramException
Total 3 objects


> pe 00007fce985a4358

Exception object: 00007fce985a4358
Exception type:   System.InvalidProgramException
Message:          Common Language Runtime detected an invalid program.
InnerException:   <none>
StackTrace (generated):
    SP               IP               Function
    00007FCE95127800 00007FCF1636CDC4 Customer.DataAccess.dll!Customer.DataAccess.Generic.AsyncDataAccessBase`2+<>c__DisplayClass1_0[[System.__Canon, System.Private.CoreLib],[System.__Canon, System.Private.CoreLib]].<CreateConnectionAsync>b__0(System.Threading.Tasks.Task)+0xd4
    00007FCE95127830 00007FCF1636CC4E System.Private.CoreLib.dll!System.Threading.Tasks.ContinuationResultTaskFromTask`1[[System.__Canon, System.Private.CoreLib]].InnerInvoke()+0x6e
    00007FCE95127870 00007FCF15BA54D1 System.Private.CoreLib.dll!System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(System.Threading.Thread, System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)+0x41
    00007FCE951278A0 00007FCF15CA33BC System.Private.CoreLib.dll!System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()+0x1c
    00007FCE951278B0 00007FCF15BA51CE System.Private.CoreLib.dll!System.Threading.Tasks.Task.ExecuteWithThreadLocal(System.Threading.Tasks.Task ByRef, System.Threading.Thread)+0x17e
    00007FCE95127680 00007FCF15CA33BC System.Private.CoreLib.dll!System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()+0x1c
    00007FCE95127690 00007FCF15BA4B3B System.Private.CoreLib.dll!System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(System.Threading.Tasks.Task)+0x4b
    00007FCE951276B0 00007FCF15CB4FF3 System.Private.CoreLib.dll!System.Runtime.CompilerServices.TaskAwaiter`1[[System.__Canon, System.Private.CoreLib]].GetResult()+0x13
    00007FCE951276D0 00007FCF16313EA5 Customer.DataAccess.dll!Customer.DataAccess.Generic.AsyncDataAccessBase`2+<CreateConnectionAsync>d__1[[System.__Canon, System.Private.CoreLib],[System.__Canon, System.Private.CoreLib]].MoveNext()+0x245
[... removed for brevity]
    00007FCE95112A00 00007FCF15BBB442 Microsoft.AspNetCore.Server.Kestrel.Core.dll!Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.HttpProtocol+<ProcessRequests>d__214`1[[System.__Canon, System.Private.CoreLib]].MoveNext()+0x472

StackTraceString: <none>
HResult: 8013153a

Note: Since this is the output from an actual memory dump sent by a customer, I redacted the namespaces containing business code and replaced by “Customer”

We see that the exception was thrown from Customer.DataAccess.Generic.AsyncDataAccessBase`2+<>c__DisplayClass1_0.<CreateConnectionAsync>b__0. The <>c__ indicates a closure, so this was probably a lambda function declared in the CreateConnectionAsync method.

Since I didn’t have the source code, I used theip2md command to convert the instruction pointer (second column of the stacktrace) to a MethodDescriptor, then fed it to the dumpil command to print the raw IL:

> ip2md 00007FCF1636CDC4

MethodDesc:   00007fcf161d3c78
Method Name:          Customer.DataAccess.Generic.AsyncDataAccessBase`2+<>c__DisplayClass1_0[[System.__Canon, System.Private.CoreLib],[System.__Canon, System.Private.CoreLib]].<CreateConnectionAsync>b__0(System.Threading.Tasks.Task)
Class:                00007fcf161c74d0
MethodTable:          00007fcf161d3ca0
mdToken:              000000000600009E
Module:               00007fcf1457e730
IsJitted:             yes
Current CodeAddr:     00007fcf1636ccf0
Version History:
  ILCodeVersion:      0000000000000000
  ReJIT ID:           0
  IL Addr:            0000000000000000
     CodeAddr:           00007fcf1636ccf0  (OptimizedTier1)
     NativeCodeVersion:  00007FCE5806D280
     CodeAddr:           00007fcf16276f20  (QuickJitted)
     NativeCodeVersion:  0000000000000000
     
     
> dumpil 00007fcf161d3c78

ilAddr is 00007FCF0C04C3AC pImport is 000000000312B860
ilAddr = 00007FCF0C04C3AC
IL_0000: ldarg.1
IL_0001: callvirt class [netstandard]System.AggregateException System.Threading.Tasks.Task::get_Exception()
IL_0006: brtrue.s IL_0036
IL_0008: ldarg.0
[...]

I’m not going to show the full IL for obvious privacy reasons, but one thing struck me: the code was not calling any method that we instrument so there was no reason the IL would have been rewritten (and it turns out there was no traces of rewriting).

Digging deeper, I got a hint at what was happening:

IL_0036: ldarg.1
IL_0037: callvirt class [netstandard]System.AggregateException System.Threading.Tasks.Task::get_Exception()
IL_003c: callvirt class [netstandard]System.Collections.ObjectModel.ReadOnlyCollection`1<class [netstandard]System.Exception> System.AggregateException::get_InnerExceptions()
IL_0041: callvirt int32 ::get_Count()
IL_0046: ldc.i4.1
IL_0047: bne.un.s IL_005b
IL_0049: ldarg.1
IL_004a: callvirt class [netstandard]System.AggregateException System.Threading.Tasks.Task::get_Exception()
IL_004f: callvirt class [netstandard]System.Collections.ObjectModel.ReadOnlyCollection`1<class [netstandard]System.Exception> System.AggregateException::get_InnerExceptions()
IL_0054: ldc.i4.0
IL_0055: callvirt !0 ::get_Item(int32)
IL_005a: throw
IL_005b: ldarg.1
IL_005c: callvirt class [netstandard]System.AggregateException System.Threading.Tasks.Task::get_Exception()
IL_0061: throw

Translated into C#, this is pretty much equivalent to:

if (task.Exception.InnerExceptions.Count == 1)
{
	throw task.Exception.InnerExceptions[0];
}
else 
{
	throw task.Exception;
}

Basically, the method was there to log a few things (that I didn’t include in the snippet) then rethrow the exception. But since they didn’t wrap the exception before rethrowing it (and didn’t use ExceptionDispatchInfo), it overwrote the original callstack!

The caller code would have been something like:

await SomethingAsync().ContinueWith(task => 
{
	// Logging

	// Rethrow
	if (task.Exception.InnerExceptions.Count == 1)
	{
		throw task.Exception.InnerExceptions[0];
	}
	else 
	{
		throw task.Exception;
	}
});

The usage of ContinueWith is confirmed by the presence of ContinuationResultTaskFromTask1` in the callstack.

That was really bad for me, because it meant that the original exception could have been thrown anywhere in whatever methods were called by SomethingAsync.

By looking at the raw IL of the caller method, I figured out that SomethingAsync was System.Data.Common.DbConnection::OpenAsync. Since the application used PostgreSQL with the Npgsql library, and since our tracer automatically instruments that library, it made sense that the rewritten method would be somewhere in there. Normally, I could have checked our logs to quickly find what Npgsql method was rewritten, but the customer didn’t retrieve them before killing the instance, and waiting for the problem to happen again could have taken weeks (random issue is random). So I decided to bite the bullet and start the painstaking process of cross-checking the source code of Npgsql with the state of the objects in the memory dump to find the exact place where the execution stopped and the exception was originally thrown.

For instance, at some point the method PostgresDatabaseInfoFactory.Load is called:

        public async Task<NpgsqlDatabaseInfo?> Load(NpgsqlConnection conn, NpgsqlTimeout timeout, bool async)
        {
            var db = new PostgresDatabaseInfo(conn);
            await db.LoadPostgresInfo(conn, timeout, async);
            Debug.Assert(db.LongVersion != null);
            return db;
        }

There were instances of PostgresDatabaseInfo on the heap, so I knew this method ran properly. From there, the LoadBackendTypes method is called, and the result is assigned to the _types field:

        internal async Task LoadPostgresInfo(NpgsqlConnection conn, NpgsqlTimeout timeout, bool async)
        {
            HasIntegerDateTimes =
                conn.PostgresParameters.TryGetValue("integer_datetimes", out var intDateTimes) &&
                intDateTimes == "on";

            IsRedshift = conn.Settings.ServerCompatibilityMode == ServerCompatibilityMode.Redshift;
            _types = await LoadBackendTypes(conn, timeout, async);
        }

But when inspecting the instances of PostgresDatabaseInfo (using the dumpobj or do command with the address returned by the dumpheap -mt command), we can see that the _types field has no value:

> dumpheap -stat -type PostgresDatabaseInfo

Statistics:
              MT    Count    TotalSize Class Name
00007fcf162568f0        1           24 Npgsql.PostgresDatabaseInfoFactory
00007fcf16256b10        2          256 Npgsql.PostgresDatabaseInfo
Total 3 objects


> dumpheap -mt 00007fcf16256b10

         Address               MT     Size
00007fce98a8adb0 00007fcf16256b10      128
00007fce98b55e98 00007fcf16256b10      128

Statistics:
              MT    Count    TotalSize Class Name
00007fcf16256b10        2          256 Npgsql.PostgresDatabaseInfo
Total 2 objects


> do 00007fce98a8adb0

Name:        Npgsql.PostgresDatabaseInfo
MethodTable: 00007fcf16256b10
EEClass:     00007fcf16245848
Size:        128(0x80) bytes
File:        /Npgsql.dll
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
00007fcf0f640f90  40001e5        8        System.String  0 instance 00007fce984d56f0 <Host>k__BackingField
00007fcf0f63a0e8  40001e6       68         System.Int32  1 instance             5432 <Port>k__BackingField
00007fcf0f640f90  40001e7       10        System.String  0 instance 00007fce984d5750 <Name>k__BackingField
00007fcf0f75e410  40001e8       18       System.Version  0 instance 00007fce98a8ae30 <Version>k__BackingField
00007fcf0f6360c8  40001e9       6c       System.Boolean  1 instance                1 <HasIntegerDateTimes>k__BackingField
00007fcf0f6360c8  40001ea       6d       System.Boolean  1 instance                1 <SupportsTransactions>k__BackingField
00007fcf16546df8  40001eb       20 ...aseType, Npgsql]]  0 instance 00007fce98a8ae70 _baseTypesMutable
00007fcf16547328  40001ec       28 ...rayType, Npgsql]]  0 instance 00007fce98a8ae90 _arrayTypesMutable
00007fcf16547858  40001ed       30 ...ngeType, Npgsql]]  0 instance 00007fce98a8aeb0 _rangeTypesMutable
00007fcf16547d88  40001ee       38 ...numType, Npgsql]]  0 instance 00007fce98a8aed0 _enumTypesMutable
00007fcf165482b8  40001ef       40 ...iteType, Npgsql]]  0 instance 00007fce98a8aef0 _compositeTypesMutable
00007fcf165487e8  40001f0       48 ...ainType, Npgsql]]  0 instance 00007fce98a8af10 _domainTypesMutable
00007fcf16548d18  40001f1       50 ...resType, Npgsql]]  0 instance 00007fce98a8af30 <ByOID>k__BackingField
00007fcf16549378  40001f2       58 ...resType, Npgsql]]  0 instance 00007fce98a8af78 <ByFullName>k__BackingField
00007fcf16549378  40001f3       60 ...resType, Npgsql]]  0 instance 00007fce98a8afc0 <ByName>k__BackingField
00007fcf162191f8  40001e3      1b0 ...aseInfo, Npgsql]]  0   static 00007fce984ecc48 Cache
00007fcf16545ed0  40001e4      1b8 ...Factory, Npgsql]]  0   static 00007fce984ece00 Factories
00007fcf16509488  400028e       70 ...resType, Npgsql]]  0 instance 0000000000000000 _types
00007fcf0f6360c8  400028f       6e       System.Boolean  1 instance                0 <IsRedshift>k__BackingField
00007fcf161d71c0  400028d      220 ...ging.NpgsqlLogger  0   static 0000000000000000 Log

Therefore, the execution stopped somewhere in the LoadBackendTypes method. That method has calls to NpgsqlCommand.ExecuteReader and NpgsqlCommand.ExecuteReaderAsync, which are two method that our tracer instruments, and therefore is a candidate for rewriting.

Great! At this point I just had to dump the generated IL, find the error, and call it a day. Right? Well it got more complicated than I anticipated, as we’ll see in the next article.