I’ve recently registered for accesss to the NHS’s dictionary of medicines and devices (dm+d). This was primarily to see what format the data was stored in and then to see if there was a way of utilising it in a cool webapp.
I downloaded the current release (its updated weekly) and unpacked the 5MB archive to reveal some XML and related files. Some of the files are huge (up to 32MB each and ~70MB in total) and there was no way a traditional program was going to mannage. I tried a few in fact and they all devastated my 2GB RAM and were generally unusable.
Time for a command line solution… VIM the open source text editor. Its extremely powerful and customisable but using it takes a little getting used to. VIM was able to open with only a slight delay and navigate these huge files. The next problem for me was being able to read them.
In theory it shouldn’t matter what indenting there is in an XML file as it doesn’t contain any data but I find its a lot easier to read the files if they’re ‘cleanly’ indented. I began wondering how I was going to solve the problem and thought of a few ideas… a script (PERL, PHP, shell, other…) but none of those came to fruition. After some searching I came across libxml.
You can download and compile from source if you wish but I decided to download a pre-built version from explain.com it was pretty good and another page I came across on entropy.ch explained how to use it within Vim to indent my files super quick.
Here’s what you do…
- To format type this sequence
- :%!xmllint –format -
- Or mark the area visually and then type
- !xmllint –format -