RE: System logs as reproducible scripts

March 6, 2007 Eric Pugh
Category: Uncategorized

Duane Gran recently pinged me about thoughts on “System logs as reproducible scripts”. Its a topic that Ive been running into recently as weve gone into testing a web based application, and dealing with some of our users very difficult to reproduce bug reports.

The challenge with a log that can be “played back” is that often the system will prevent the same set of events from happening again. For example, if you record a series of events that includes creating a user, then when you play them back the user creation step will fail with a “user already exists” type of error.

However, this isnt to say that this isnt a good idea. With in narrow scopes, this has been done. For example, there are tools that intercept your SQL transactions and record them for use in rebuilding indexes and debugging after the fact issues. Ive also seen a tool that attempted to record all your SQL statements, and then replay them in reverse to restore your database to a known state!

Right now the tools available dont help us follow the path of execution through a web app for a specific user very well. We can see part of the picture, via grepping through a webserver log for a session id, but its difficult to tie that into a log of SQL statements issued for that session. And there arent any good simple tools for connecting data collected from Google Analytics, and matching it up to other data sources like server logs or SQL transaction logs.

On the server side, there are tools like DTrace that provide a developer with indepth visibility into what is happening to an application. DTrace was used by some of the PostgresSQL developers at OSCON 2006 to find many bottlenecks in the performance of PostgresSQL for example. Now we need to bubble up one more level up and be able to view the actions performed by a specific user that may cause activity on multiple tiers and systems!

I used to work with Canoo WebTest, a tool for functional testing of websites. And one thing it did that was wonderful was to record on the local filesystem a copy of every HTML page that it recieved back from the server. This meant that you could go back and “see” what your Canoo Functional Test “saw” at each step. Selenium is my current favorite for Functional testing. But it doesnt record each page as it runs through them, so you have to figure out what tests failed by just the error message that was created when running a suite of tests.

Maybe what we need, at least from a web application perspective, is the ability to save each page displayed to the user? I can envision the webserver sending a copy of each page back to the user, but also saving a copy locally, and aggregating them based on session id, and naming (and thus ordering) them based on time? Ajax calls of course might be just the raw XML that was sent back to the client between full HTML page loads. While this might put inordinate load on a large public website, it might be just the ticket for when you are doing development/testing of a website. A bug report with a timestamp and session id could be tied back to the sequence of webpages, and the developer could visually see each step that was taken to generate the error.

Ideally the list of pages could be tied together in some sort of slideshow that the developer could play back to see what the user did. While this approach isnt the generic “System logs as reproducible scripts” that Duane was looking for, it is a step in the right direction!