Wednesday, July 19, 2023

Discoverable code documentation

As obvious as it seems, it matters where code documentation appears, and how discoverable it is. If the documentation exists, but nobody knows about it, it isn't very useful.

What are some ways to make documentation more discoverable and more useful? Do we need SEO for code documentation? Different type of documentation may be needed for different use cases.

Before we go into boosting discovery, what can go wrong with missing or incorrect documentation? One could face a bug from misunderstanding a library or language feature, then spend hours to reverse engineer and debug. Sometimes a developer might follow up and fill in the gap on Stack Overflow, or worse, in bug trackers (from old post).

Boosting method 1: put it in the middle of a workflow

Example 1 -- good compiler error messages. Compilers are part of most coding workflows. You just hit a problem. If the compiler error message is good, it will tell you what is wrong. Maybe it even tells you a likely fix. Maybe it gives you a link to a webpage with more details for the "why?". This helps when there is already other documentation where you learned about something in the first place, but you forgot a bit of the concrete details. So the initial knowledge helps you get to close enough to write something close to right, and the compiler reminds you how to make it right.

IDEs might go a step further and apply the fix for you. This could lead to slightly twisted workflows. You might write code that is intentionally wrong, but fewer keystrokes than the right thing, knowing that you can lean on the IDE to fix it for you in fewer keystrokes. For example, maybe the return type of a method is pretty verbose, so you might just type "sessionState = getSessionState();" and let the IDE fill it in. It's sort of like a backward autocomplete.


You can also inject information into "compiler error messages" e.g., by attaching information into a static_assert or at runtime in an assert.

Example 2 -- diagnosed from scanner tools. You hit a problem. Let's say it's a performance problem where jobs start to time out. In your workflow it is common to reach for a job status page. Maybe the job monitoring tools can scan the way you've set up the jobs, or has profiling information, and tell you about common tweaks to enhance performance. The common tweaks are documented somewhere, but maybe it is hard to remember every time, or the documentation was TLDR or not discoverable. Again, requires some knowledge to get started, but the pointers help polish. Another example could be static analyses that report problems and remind you of a documented sharp corner or footgun.

Downsides

  • not for learning in the first place
  • not for every use case (e.g., assumes common tools for workflows)
  • documentation ends up scattered, e.g., within string literals in the source code of tools
    • might be hard to internationalize/translate, since docs are not organized well for translation teams to work through and know when there is an update that also needs translation

Boosting method 2 -- standard location

Example 1 -- javadoc, docstrings, and other forms of method comments are nicely discoverable because by convention, you know where the documentation should be. These are great for learning in the first place. In some cases, there is convention for where to find test cases or example code that you can learn from as well.

It might not be nicely ordered for learning like a book or lessons/tutorials, but it could be good for learning small bits like a single API that is independent of other APIs.

IDEs can also take advantage of this convention to gather up "quick docs":


One weird issue I've found is when you start having some indirection or aliases, then it's not clear which alias has the right comment doc. E.g., absl's flat_hash_set hides most of its implementation in an internal base class raw_hash_set. The public documentation is where you expect: in the public flat_hash_set header file. However, the method body isn't there. So navigation through a code search web app or an IDE will probably lead you to the internal raw_hash_set header file. There is documentation there too, but it is not useful ("this overload kicks in when ...").

Example 2 -- the word of mouth for good references. E.g., word of mouth might indicate that Wikipedia is a high quality source, or cppreference.com is a great place to learn APIs in the C++ standard library, or learn parts of the language with a nice index and search (vs reading through the ISO C++ standard) search. Or stackoverflow might have a good reputation for finding answers to common problems.

Downsides

  • reference-form documentation may be good for learning one independent unit at a time. Also assumes some background. It won't have a lesson plan to help you learn a larger topic from scratch.
  • can get out of date if you update the code but forget to update the docs

Boosting method 3 -- rank highly in search

Search engines that work well might rank things highly in Boosting method 2 example 2 instead of relying on direct word of mouth. It might also show hits for places you don't expect to have good information, like bug trackers.

Boosting method 4 -- advertise in small chunks

E.g., break useful information into small 1-pagers.

  • send out a weekly newsletter or a "tip of the week"
  • or share with peers in code reviews and other conversations
  • or advertise in social media

Or break things out into X min chunks that can fit into a tech talk or podcast.

Or have a reading group for larger documents (though doesn't scale as well).

LLMs

 It's 2023. LLMs are hot. How can they help here?

Already done: chat interface where you ask the LLM and the LLM gives you a summary or generates code and maybe explains each step.

Otherwise (and mostly if the above fails and you need to learn more yourself):

  1. downsides from many of the boosting methods are that they are:
    1. not good for first time learning, but more for small units of learning or reminders 
    2. scattered
    3. Can an LLM digest all the scattered documentation and synthesize a better organized tutorial or lessons or books?
    4. Generate a system diagram or some more visual "summary".
  2. a downside of javadocs and embedded in workflow/code docs is they aren't great for having translations. Can an LLM help translate?
  3. detect when documentation is incorrect or out of date after changes to source code?
  4. generate missing documentation from code (parameter x must be sorted, or other important invariants)?
  5. generate example code that is good for further learning?
  6. synthesize the highlights of a step by step evaluation in the form of slides, vs what a debugger would show, which may be more verbose?
  7. convert tech talk videos to text summaries
Not great? 4 doesn't seem like a good thing to do, even if it was possible. Writing comments can help you better understand if your written code makes sense, or if an API is "good" or if it has too many sharp corners. It's sort of like rubber duck debugging, but makes you review the code and design to avoiding shipping bad APIs.

No comments:

Post a Comment