SQR-013: LSST DocHub Design

jsick · November 30, 2016, 10:18pm

One of the most consistent pieces of feedback I’ve received is that DM documentation, and information in general, is hard to find. We have technotes, user guides, GitHub repositories, DocuShare, Confluence, the Community forum, and JIRA, Zenodo, and ADS, among other places to find information about DM. For the past few months, we’ve mentioned DocHub as a solution to discovering LSST/DM information across disparate platforms, including in the proposed LDM-493: Data Management Documentation Architecture.

You can think of DocHub as a single website from which you can find all tracked LSST documentation and information artifacts.

Thanks to a DESC Hack Week a couple weeks ago, I had an opportunity to explore the DocHub concept. You can read about the resulting design in SQR-013: LSST DocHub Design. Here’s the abstract:

LSST DocHub is a proposed solution to information discovery for LSST Data Management, and the LSST project in general. LSST documentation and information artifacts are published through a variety of platforms by virtue of the way information is created — from documents archived on DocuShare, to source code on GitHub, to conversations on community.lsst.org. Currently, staff and users must go to each platform to find information. This has an overall effect of slowing, and even preventing, knowledge sharing.

LSST DocHub can solve this problem by decoupling information publication from information discovery. DocHub consists of a unified web front-end for documentation browsing, filtering, and search. The front-end is fed by a web API to centralized metadata and full-text databases. These databases are populated by adapters that monitor each of LSST’s information platforms for new and updated artifacts. DocHub stores metadata as JSON-LD, which is a community-standard, extensible, and self-describing schema. This technote establishes the basic design concept for DocHub, including its architecture and JSON-LD metadata patterns.

Architectural overview

To summarize discussion in the technote, these are the components in DocHub’s architecture:

A metadata schema

DocHub uses JSON-LD since it is extensible, yet self-describing. DocHub builds upon the CodeMeta JSON-LD schema as much as possible. Similar to other information discovery sites, like code.gov, this metadata will be embedded in source repositories whenever possible. The technote also describes a way of templating the metadata embedded in Git repositories, allowing as much information as possible to be extracted from a project’s content. The same metadata format is used in the database.

A metadata database

DocHub uses a MongoDB database to store all metadata. MongoDB is a document database that works natively with JSON. The JSON-LD that’s embedded in source repositories is available through MongoDB.

A full-text database

While MongoDB is well-suited to querying semi-structured data like JSON-LD, its full-text search capabilities are more limited. Where possible, the content of documents will be stored and made available through Elasticsearch.

Ingest adapters

Each ingest adapter is a microservice built to transform content and metadata for a particular type of artifact into a JSON-LD record and full-text entry stored in the MongoDB and Elasticsearch databases. This adapter architecture helps DocHub scale: indexing a new arbitrary information source involves deploying a new adapter service. Adapters can either by pushed to (say, by a GitHub webhook), or can poll a platform for new and updated artifacts. Each adapter handles the platform specific challenge of transforming either templated JSON-LD stored in a source repository or a platform’s native metadata into standardized DocHub JSON-LD.

An API server

The web API server allows applications to query against DocHub’s metadata and full-text databases. DocHub will likely provide a RESTful API for JSON-LD documents and a GraphQL API to efficiently populate the web front-end.

A web front end

This front end is how people typically use DocHub. This website will allow users to browse and filter DocHub information artifacts, and also provide a generic search against the full-text and metadata databases. The website will be editorially designed to some extent. For example, the front page will show featured projects, papers and documents in addition to giving entry points to search and browse against usefully-selected categories. Generally the website (and API) will allow anonymous access. DocHub can be designed to facilitate authorization-based access to non-public documentation (private GitHub repositories) for example, though this will depend on a centralized user database that doesn’t exist in the needed form yet.

While there will be one main DocHub front end, the API server allows us to embed smaller front ends into other sites, such on user guide and technote pages.

Next steps

While SQR-013 doesn’t completely specify DocHub’s design (the exact JSON-LD schema still needs to be finalized), it is sufficient to get us started. The main challenge is accomplishing this work within the available resources. DocHub’s architecture allows us to build it up piecemeal, which I found quite effective when I launched LSST the Docs this spring.

Listing technotes is a good minimum-viable product to launch DocHub with. A possible implementation path towards that goal is:

Deploy the MongoDB cluster.
Add JSON-LD templates to select technote repositories and deploy a technote adapter microservice that builds JSON-LD metadata documents as those technote repositories are updated.
Build a temporary static site generator to publish a web page listing technotes to www.lsst.io (or whatever URL is chosen for DocHub) based on metadata at rest in the MongoDB metadata database.

With this MVP in place, we can begin to develop DocHub in independent tickets (even in parallel):

Build the API server’s RESTful and GraphQL endpoints.
Replace the static site generator with a React application powered by the API (and begin implementing faceted search patterns).
Build up full-text search.
Build additional ingest adapters for other types platforms (GitHub repositories in general, presentations in Zenodo, documents in DocuShare, and so on).

Again, see SQR-013 for more DocHub design information.