Documenting Your Org's Files

I’ve been thinking as of late about the things that desktop computer users get that we’ve decided isn’t worth giving software engineers in their own infrastructure.

I don’t think I’ve ever met anyone who works at a big tech company who is happy with the status quo. The closest we get is less sad. And a lot of the pain points are fixable, but nobody fixes them because they aren’t key business needs. Setting aside, for now, the conversation of what that means (happy engineers are more often than not more productive engineers), I’m going to document some basic quality-of-life improvements as they come to mind and provide some easy and more complicated suggestions for improving them.

So let’s start with document identification.

Double-clicking should just work

When you double-click on a file icon in your OS of choice, it will, in general, just open. You almost never have to think about that interaction.

A file browser window on Linux. Double-clicking any icon shown would open that file in a program.

Even on Linux distros, though I prefer the command line.

I don’t imagine most enterprise software engineers expect to have that experience with anything stored in their blobstores (S3, Cloud Storage, etc.). We content ourselves with that content just being “whatever,” and then you get a new team member and uh-oh. This is a mostly self-inflicted problem, though nothing in the ecosystems of these technologies make it particularly easy to avoid it.

So what can we do about it? Let’s take a look at how OS’s solve identifying the relationship between files and apps:

  • Files have a suffix
  • Each suffix is mapped to one or more programs that can handle that suffix

Great! Let’s do that.

The cheap solution: File suffixes

I know it seems obvious, but I’ve seen far too many data.bin flies in too many orgs to believe it is: if you choose consistent file suffixes and use them, you can really shave a lot of discoverability pain off your team. “But the suffix isn’t up to us,” I hear the strawman in my brain saying, “these are all JSON files.” Absolutely! That’s what having.two.suffixes is for. Nothing stopping you from naming a file november.accounts.json and understanding that accounts.json is a standard file format.

!
If your system can’t support files with two periods in the name, I would fix that. But if it can’t be fixed use hyphens or underscores or something. Just be consistent.

Now that we have suffixes, what do we do with them? Put them in a spreadsheet, alphabetized, with a small description and a URL to either internal documentation or the interfaces that read them.

You have just saved your team leads a bunch of time answering the same questions over and over.

The more expensive solution: A file identifier

This is likely overkill for most orgs, but if you want to get particularly fancy, you can build a file type identifier service or app that can guess what an arbitrary file is. You can start from a package like file-type that knows common binary formats and then add support for your own.

!
If your org uses XML, this is one of those rare times that decision wins hard, because XML is completely self-describing.

“Okay, but it’s just going to tell me I have JSON or protobuffer or flatbuffer or whatever.” Correct, and you’ll want to make it extensible so you can add heuristics to guess at what kind of JSON or proto you’re looking at. These can be pretty straightforward (identify the high-level file type, then feed the actual data to e.g. a JSON parser and check for the presence of required fields in the type).

Some things to consider:

  1. This approach requires maintenance, and it’s entirely possible that the cost won’t prove to be worth it to the org (but when doing that calculus, factor in what happens if your lead engineers leave / die and a whole fleet of juniors has to come in and understand your architecture from scratch…).
  2. If it’s built as a service, remember that people will upload anything to it, so it has to have the security protections of something that is allowed to look at literally every piece of data your org puts into a file. Depending on your org’s level of paranoia and internal data-hiding, you may need to shard the program into a per-team-maintained isolated system / CLI tool / service.
  3. The program is a component that everyone in your company touches. Those can impose engineering challenges of their own (synchronizing teams, release cycles, etc.). It is possible that you can get clever here and make the build system capable of identifying file format descrpitions (such as protobuffer .proto files) to build their own detectors.

Regardless of the solution you choose, the more specialty knowledge of “what these bytes in long-term storage mean” you can move out of people’s heads and into the system itself, the less time your people will spend hunting for what they need to solve problems.

Comments