Archives

Wednesday February 10, 2010

Configuration File Trickery

I tend to use a fairly minimal subset of YAML for configuration files. YAML supports my favorite data structures (lists and hashes) and is easy to read and edit (particularly if one ignores its syntax for declaring data types and such). However, in a recent project, I found myself using CSV (comma-separated value) files, instead.

The initial motivation for this choice was my client's preference. He uses spreadsheets (typically, Numbers) on a regular basis and wondered if we could use them to enter configuration information. I couldn't see any reason to object, so we started down that path.

Syntax, Structure, and Semantics

Any data representation format has to deal with three kinds of problems:

  • syntax - distinguishing and representing data elements
  • structure - representing relationships among data elements
  • semantics - determining the "meaning" of data elements

YAML handles syntactic and structural problems very nicely. I don't have to worry about how to parse the files and the imported data structures can be arbitrarily complex. At worst, my code will need to build occasional supplementary indexes.

In general, I tend to handle semantic problems in my own code. I'm beginning to realize that Semantic Web technology offers some interesting alternatives for large-scale data-integration projects, but that's far beyond the scope of this posting...

Cooking Up Hashes

Although CSV handles syntactic problems, it isn't nearly as flexible as YAML when it comes to representing data structures. Ruby's CSV library turns a CSV file into a row-major list of lists. In contrast, I typically want to use the data as a nested hash.

For example, I might want to define some of the CSV labels as top-level keys, letting the others share the lowest level of the hash. A representative spreadsheet might look like this:

    NameValueDescription
    age11Age, in years
    height22Height, in inches
    weight3Weight, in pounds

My cook_csv routine creates a multi-level hash, using the raw data and a list of field names as arguments. I can then access the hash, much as if I'd loaded it from YAML:

      rare = cook_csv(raw, [:Name])
      ...
      age_descr = rare[:age][:Description]
    
In some cases, there may be rows which have identical values in the lowest-level key. Specifying 0 as the last field name causes the row index to be used as a key; this keeps the rows from over-writing each other in the hash.

Generating Variations

One of my spreadsheets summarizes information on a set of product models. Naively, each row would contain the model number and a set of characteristics. However, this would have made the spreadsheet unwieldy and error-prone.

Fortunately, the vendor's model names contain values and/or hints to the characteristics. By borrowing a bit of magic from regular expressions, I was able to reduce the number of rows by an order of magnitude. Let's say that our naive spreadsheet looks as follows:

    ModelType HWD
    ABC83344ABC 83344
    ABC113355ABC 113355
    ABC143366ABC 143366
    ...
    ABC203366ABC 203366

There's a pattern here, if only we could take advantage of it. It wouldn't be easy to parse, given that the H field varies in size, but it's quite easy to generate.

Specifically, we can use regular expressions to create the variations, then capture the sub-fields for use elsewhere in the row. My exp_models routine does just this, reducing the snippet above to:

    Model Type H W D RegExp
    \1\2\3\4 \1 \2 \3 \4 {ABC} {8,11,14,17,20} {33} {44,55,66}
The data has certainly gotten more complex (some might say, inscrutable), but the reduction in rows is an enormous win. In the actual application, I'm averaging about an 8:1 reduction in rows, which more than compensates (IMHO) for the added complexity.

Similar pre-processing techniques can be used with other tools. For example, it's not uncommon to find YAML and embedded Ruby being used to generate test fixtures and other support files.

DRYing Things Out

One of the spreadsheets contains about 100 rows of part specifications, with each row containing more than a dozen values. Because there is a lot of repetition (ie, parts that have the same, or related, values), it would be nice to refer to the values symbolically.

So, I store assorted (sub-)expressions in a Macros spreadsheet, reducing the Parts spreadsheet to symbols and literal values. Because the results are actually going to be used as "dynamic attributes" in Google SketchUp models, my code don't actually have to evaluate them, just assemble and output them. So, a relatively simple bit of macro pre-processing does the trick.

Each row in the Macros spreadsheet looks like this:

    NameFormulaDescription
    %CA(%RAD * %RAD * 3.14)area of a circle
    %RAD(%DIA * 0.5)radius of a circle
    ...
After loading the relevant CSV files, my code iterates over their content, replacing symbols (eg, %DIA) with corresponding, fully-resolved expressions. The iteration continues as long as unresolved symbols are present and progress is being made.

This approach is working just fine, with the small problem that the resulting expressions can be rather repetitive:

    ( (%DIA * 0.5) * (%DIA * 0.5) * 3.14)
If this became a real issue (eg, making SketchUp run too slowly), I could bring in some code to simplify the expressions. However, at present, the brute force version is working just fine.

Naive Inheritance

Some of my spreadsheets contain vendor-specific information, others do not, and some contain a mixture. I handle this by allowing both common and vendor-specific instances of each spreadsheet. If there is only one spreadsheet of a given type, the program uses it. If two spreadsheets are present, the program concatenates them (discarding the intervening header row).

Even though this doesn't provide the kind of flexibility I enjoy in Ruby, it allows me to segregate the information in a convenient manner.

Conclusions

I used a few other data-munging tricks in this project (eg, ignoring empty rows so that they can be used for spacing), but these transformations were certainly the most important ones.

DSLs (domain-specific languages) are all the rage in the Ruby community. I hope that some of the notions above have shown you that they can also be used to good effect in solving data representation problems. Incidentally, Martin Fowler has an upcoming book on DSLs; you might want to keep your eye out for it...

Configuration File Trickery in Computers , Ruby , SketchUp , Technology - posted at Wed, 10 Feb, 14:37 Pacific | «e» | TrackBack


Post a comment

Note: All comments are subject to approval. Spam will be deleted before anyone ever sees it. Excessive use of URLs may cause comments to be auto-junked. You have been warned.

Any posted comments will be viewable by all visitors. Please try to stay relevant ;-) If you simply want to say something to me, please send me email.