How to Improve OpenOffice.org Writer Indexing
OpenOffice.org is a great piece of software. And at what better price could you get it? But as with everything, there's always room for improvement.
In particular, the indexing in the Writer component could go a lot further than it does. Until recently, I thought there was only one thing wrong with the indexing in OpenOffice.org. But then I began investigating how to do indexing properly, rather than just in the ad hoc fashion I had always used. I also compared it with the feature lists of some of the commercial professional indexing packages, and I began to see what could be done better.
I realise that many of the things I mention below are possible by allowing editing of the index, but such things disappear when the index is updated, which is not at all satisfactory.
Standard Inflexibility
One of the disadvantages of working with a standard is that it is just that: standard. Any changes can only be made through a long and laborious process. This is relevant because the Open Document Text (ODT) file standard does not support the ideas described below, and until it does, the changes that I am about to suggest will be difficult, if not impossible. However, presenting the information about where I think it should go is the first step towards getting results.Having said that, I don't claim to be any kind of expert in ODT; here's probably a lot of fine-tuning that needs to be done to these ideas. But I think the basic concepts are sound.
A New Look at Index Entries
The way indexes are currently constructed in ODT is that items called "Index Entries" are inserted throughout the document, and then the software, when told to update the index, compiles all these things into an index. However, I think this means that two concepts are being confused. One is the entry in the index. The other is the mark in the document, to which the index entry refers. In the most typical case, this distinction is entirely unimportant. But there are so many other cases where the distinction is crucial.I also think that the distinctions that are currently being made between the different types of index entries, and between them and bookmarks and references are not useful, but my reasons for thinking this will only make sense after I've explained what I propose to replace them with.
The New Ideas
Anchors
Strangely enough, I think HTML got it generally right on this one; anchors are the best solution. I'd propose that the anchors have 2 attributes and two child tags.The attributes I suggest are (as with HTML) the "name" and "reference" attributes. An anchor with just a name could replace a bookmark. An anchor with just a reference could replace the reference tag. An anchor with both could be both a bookmark and a reference at the same time. The use of this will become obvious later in the document.
The child tags I'm proposing are the template and data tags. The idea of the template tag is something like this:
<anchor name="Getting Started">
<data>
<data-item name="Name" Value="Getting Started with Earwax"/>
</data>
</anchor>
........
<anchor reference="Getting Started">
<template>More information in the section <data-item name="Name"/> on page
<data-item name="PageNo"/>.</template>
</anchor>
In case it's not immediately obvious, the idea is that an anchor can provide data, or use data provided by other references it refers to. In the above example, the first anchor provides the Name data item, and the PageNo data item is calculated. The second data item has no data of its own (although it could), but has a template that uses the data from the first anchor (which it can do because it refers to it).
I will also propose the following attributes for anchors, whose use will become obvious throughout the rest of the document:
- NoPassData
- Parent: can have multiple values
An anchor which points to an anchor which points to an anchor should receive the data of the anchor that is two steps away from it, unless the NoPassData attribute of the middle attribute is set; if the NoPassData attribute is set, then that anchor will pass its own data, but not the data it receives from any references. Templates are not passed.
The stuff above could have
<anchor-start/>and
<anchor-end/>tags.
Anchors are not necessarily unique within the document; that varies from dataset to dataset.
Anchor Datasets
The Anchor Dataset is a new concept. The idea is that it is an invisible repository of anchors, which can be accessed by a number of different Dataset Accessors, such as indicies. An Anchor Dataset would include information about:- Which anchors in the dataset are parents of others
- Whether to merge keys case-sensitively or not
- Which words to leave out altogether (ie. merging "College" and "The College")
- Whether, when a key is merged in, to leave a cross-reference or not
There would be two pre-defined Anchor Datasets:
- Default: All the unassigned anchors within the document (ie. any anchor that doesn't have its Parent tag set and is not inside an Anchor Dataset). Its target anchors must be unique (see below).
- Headings: All the headings in the document (which are automatically anchors). Its target anchors must be unique (see below).
Inside an Anchor Dataset, the "Parent" attribute on an anchor specifies the name of the parent anchor; this information is used when constructing indicies.
Outside an Anchor Dataset, the "Parent" attribute indicates the name of the dataset(s) which refers to this anchor.
The anchors inside the dataset must be unique. A dataset's "target anchors" are the anchors which are referred to by the dataset, but are not inside the dataset element. Each dataset would have to define whether its target anchors must be unique or not.
An anchor dataset can be accessed either by an Anchor Display (which would generally deal with a number of anchors selected), or by an Anchor (which would generally deal with a single anchor selected from the dataset).
Anchor Displays
An Anchor Display would format the results of an Anchor Dataset. This would replace the Index. In addition to the information currently contained in an Index, it would have a reference to one or more datasets.An Anchor Dataset and an Anchor Display will both need to be created for the index. This would mean that one Anchor Dataset could feed information to multiple Anchor Displays, and one Anchor Display could pull information from multiple Anchor Datasets. An Anchor Display should be able to filter the data it receives from the dataset.
An anchor display would probably need to have the following fields, in addition to the information already stored in an index.
- The name of the display
- A list of the Anchor Dataset(s)
- Filters for the information selected from the Anchor Datasets (ie. only show entries from the current chapter)
- Templates for the different types of anchors (ie. apply this template if these conditions are true, otherwise this one, etc)
- Sort ordering
- Letter-by-letter vs. word-by-word alphabetising
- Numeric vs. string
- Specified on a per-anchor-parent basis; ie. everything under one key is sorted one way, and everything under another, a different way
- Which words to ignore when sorting (in, by, the, etc)
How to Replace the Old Ideas with the New
How to do a Table of Contents with the New Ideas
This is quite simple; insert an Anchor Dataset that draws on the Headings dataset, and everything will happen. If the index is just for one part of the document, then filters can be applied to just select that.How to do Indicies with the New Ideas
Index Entries should all be replaced with anchors which have a Name.The Index will need to be replaced with an Anchor Dataset and an Anchor Display. The advantage here is that multiple displays can draw on one dataset, and one display can draw on multiple datasets.
A great example would be in a document with a bibliography and an alphabetical index. The anchor display for the alphabetical index could also draw data from the bibliography dataset, and have an index entry "Books", with the subentries saying where in the document each book is referred to.
The Anchor Dataset would contain an anchor for each item to appear in the index.
Examples of the use of anchors as index entry replacements:
- In the case of a normal index entry, this could be simply the name of the reference, and maybe a parent under which the entry is to appear.
- In the case of an index cross-reference, this could also contain the target of the cross-reference (for example, "Denfield: see Azmon Denfield").
- In the case of a bibliography, the data could be a large number of fields. The bibliography data would generally be stored in the bibliography dataset, and then selected by anchors which would refer to it (which is the reverse of how normal indicies work), as well as being selected by the bibliography display
New Features
The following new features for OpenOffice.org (not ODT, but OpenOffice) would then be possible. Not all of them require the reorganisation suggested above, but they would all benefit from it- Better Index Editing facilities
- A tabularised list of anchors selected from a Anchor Dataset
- Designed for editing the Anchor Dataset
- Displayed in a separate window
- Filters can be applied from an anchor display, or on an ad hoc basis; possible filters include whether they are in a certain chapter or not, or contain certain text, or whatever
- Sortable, but the sorting does not affect the anchor dataset
- It would be possible by clicking to go to the point in the document referred to
- A tree-like display of anchors selected from an anchor dataset. This could be implemented as part of the current "Navigator" feature; select a dataset, and it would display the information.
- Designed for editing the Anchor Dataset
- Dragging and dropping is used for reparenting/duplicating entries
- Doesn't need to display all fields
- Double-clicking on an entry would bring it up in the tabular window mentioned above
- The ability to find any bookmark or reference instantly, or tell you that it doesn't exist
- A tabularised list of anchors selected from a Anchor Dataset
- Better index data management (sorting and filtering)
- Multiple-Dataset indicies; the "Books" entry in the alphabetical index could list every time that a book is referred to in the book (ie. "The Importance of Being Earnest", 23, 45, 73)
- Part-indexes created from the same anchor dataset as the main index
- Duplicate index entries under different names
- More customisable sort orders
- More customisable filters
- Better final output
- Indexes could have internal cross-references; it would be possible to have an index entry that says something like "Denfield: see Azmon Denfield". This would require the existence of a Bookmark with no References.
- More customisable templates
- A library of templates as requested by different publishers
- User-customisable, storable locator formats, from simple page references to complex volume or database references (these should not be just for indicies, but for all cross-references)
Disadvantages
While the information above doesn't specify how to replicate every single feature of indicies, references, and the like, the ideas above can, I believe, be extended to cover most of the same functionality.The only thing I know of that can't be replicated under the above scheme is the fact that the data in bibliography entries would (as they could contain random data items, or rather random key/value pairs) no longer be checkable by the XML Schema files; if developers wanted to put in "Writer" instead of "Author", there'd be nothing to stop them except the spec, which would specify the bibliography fields which should be commonly supported.
Conclusion
In my opinion, the advantages of the new scheme greatly outweigh the sole disadvantage that I discovered. That's why I recommend that OASIS go about incorporating it into the ODT format.- wayland's blog
- Login or register to post comments
- Printer-friendly version
Delicious
Digg
StumbleUpon
Propeller
Reddit
Magnoliacom
Newsvine
Furl
Facebook
Google
Yahoo
Technorati
Icerocket