Tuning log verbosity and mining information from logs

@price gave me some feedback on Slack#dm-jointcal about how much information is currently output at the INFO level in jointcal. @jbosch and @timj both commented that there has been ongoing SQuaRE work to explore more sophisticated log analysis tools (e.g. logstash); we currently have no guidance on how to extract information from logs, nor on how much information to put at various levels. This results in a tension between those of us who follow the Unix “only print warnings and errors by default” approach versus “print information that could be useful to later error reconstruction or allow others to explore properties of a run without re-running it themselves”. That latter approach requires some judgement about how useful a given output is compared with its frequency.

If we had more intelligent logging tools, we could let our logs be more verbose without things getting lost in the “clutter”; what is clutter and what not is context dependent. This has resulted in some back-and-forth on various tickets with different people claiming a given output should be somewhere between DEBUG and WARN, depending on their particular use case of that code and where they land on the “verbosity spectrum”.

Do we have any writeups about those log analysis explorations, or any further plans for how to manage logs? This could help people try out different approaches to log analysis and give us a sense of just how verbose is “too verbose”.

Related tickets

This is not directly related to the accepted-but-not-implemented RFC-245: that would add an additional log level but we would still have to make judgement calls about whether certain information goes into DEBUG vs. VERBOSE vs. INFO.

This is somewhat related to DM-25942, in that we don’t really have guidance on how to log different types of error conditions. The dev guide has a sentence per log level but that’s open to a lot of interpretation.

There was a similar discussion on Community two years ago (Warnings or other not-errors in the Stack) which resulted in some discussion about logging policy. In that discussion, @ktl said

I’d say the current de facto policy is to use log.warn and then rely on the user noticing, given that warning-level log messages are supposed to be prominent in the output, or rely on a workflow system explicitly scanning the log for a warning if that were to be needed.

Warnings are regularly not-noticed (and probably for many cases, they may not be real problems, but the software can’t always know that a-priori), and we have no workflow to scan logs.

My feeling is that default logging should be optimized for human interactive execution, with messages only indicating progress (at INFO level) or problems (at WARN level, but also including diagnostics that might indicate problems to a human but not necessarily to the algorithm at INFO level).

Another level of logging, perhaps selected by a custom log-related flag option, would be appropriate for automated execution. It might include additional information useful for forensic analysis and performance monitoring. These are the logs that would go to a log collection system that could be used for analysis and alerting. These do not necessarily need to be at DEBUG level; they could also be at INFO level but off by default. (For example, the loggers for the additional messages could reside in their own logger hierarchy.)

(Aside: The first implementation of a log collection, analysis, and alerting system for pipelines seems likely to come from the IDF at this point. I believe all would like to see such a system at NCSA as well, but the work to set this up needs to be assigned and prioritized.)

I feel that debugging is best handled by selecting modification of the thresholds of particular loggers of interest rather than shotgun threshold changes across the board. I believe sufficient capability is available in the CmdLineTask/PipelineTask argument parsing to enable this. To my mind, this obviates the need for a distinction between DEBUG and VERBOSE.

The above comments suggest increased use of the logger name hierarchy, which I feel is generally underused and unappreciated at present. Use of logger names would be substantially aided by a documentation tool that could automatically extract the available names, either statically or dynamically. It is possible that existing logger names and the hierarchy may need to be reorganized as well.

The comments I made in the quoted Community post were primarily intended to summarize current (at that time) de facto practice, not be prescriptive as to what should be done. I was and still am hoping for additional input as to what developers and users want.

2 Likes

For the record (I’ve already mentioned this in some of the posts @parejkoj linked), our logging framework is ultimately based on Log4J, which specifically prescribes that INFO should be reserved for progress messages, with diagnostic information restricted to DEBUG or TRACE.

I support greater visibility of multiple loggers, since it’s effectively impossible to run an entire pipeline at DEBUG level. However, this may require some changes to how loggers are handled: for example, if I enable DEBUG logging in a particular subtask, should I also expect DEBUG output from any meas_algorithms functions called by that task? Should we require that all non-task code accept an external logger as an optional argument? (I’d prefer the answer be “no”, as it might exacerbate the context-dependence problem.)

Running our own log collection and search infrastructure has proven prohibitively expensive in terms of staff time, still without reaching the level of stability and reliability that we would like. For the IDF, we will be exploring using Google’s Stackdriver logging platform instead.

The two bits of advice I have from personal experience:

  1. Log everything in JSON (or some other standard structured format but for line-based logging, JSON is usually best) if you can. You’ll never regret adding structure to logs if you use any query or analysis software.
  2. Don’t log detailed trace information by default, but have some easy way to turn it on, ideally without having to tear down whatever you’re trying to investigate. Logging it by default will often make it harder to debug problems, not easier, because the critical error message will be buried in thousands of lines of meaningless output.