Archaeological and Forensic Investigation
I have found it useful to define these two terms as it relates to investigating behavior in code: forensic investigation, and archaeological investigation.
Forensic investigation involves asking questions about the current state of things. This means using:
- Debugger (GDB, pdb, MATLAB debugger, etc.)
- Log statements
- LSP code navigation (go-to-definition and find-all-references)
- binutils
- Pretty-prints (dumping LLVM or other IR)
- Pure refactoring for clarity (“scratch refactoring” as Michael Feathers describes it in Working Effectively with Legacy Code)
- Adding unittests for the current, unchanged behavior
Archaeological investigation, on the other hand, means approaching the problem through the history of the codebase. Useful tools for archaeological investigation include:
- “Blame” through version control history
- Bisecting a bug through old commits to see which commit introduced it
- Searching internal documentation, commit messages, etc. for historical reasons code is structured a certain way
- Diffing public documentation between versions
- Diffing the output of log statements between versions
- Diffing pretty-prints (LLVM or other IR) between versions
These two terms have developed a common language when discussing bug investigation at work. When bugs show up in CI, often the first thing I reach for is archaeological investigation. If many changes are bundled together, a bug can be hard to debug without more information about how it was introduced. So when a test starts failing, there are some easy questions to ask to narrow down the culprit:
- Did this test file change in this job?
- Which source files upstream of this test changed in this job?
- Did the test run in the last job? When is the last time it ran and passed?
Answers to these questions are helpful because while it is not guaranteed to find you the answer, it might point you in the right direction. Archaeological investigation can let you “cheat” so you do not have to understand the entire system to get to a root cause.
Archaeological investigation is also helpful if you need to know for transparency or compliance purposes which versions of the software are exposed to a particular bug. That is of course more relevant if multiple versions must be supported for customers at one time.
Sometimes archaeological investigation is insufficient:
- A bug may have always existed, in which case archaeological investigation will teach you nothing about the reason for the behavior in question.
- Diffs might be too large and complicated to read through and understand.
- For very old versions of software, source code might be gone or difficult to build with existing infrastructure.
- CI artifacts may not be preserved indefinitely.
When archaeological investigation fails, forensic investigation can yield a much more complete picture. You can’t really understand the system without the forensic investigation, which gets you much more familiar with the actual behavior (intended behavior notwithstanding).