The post is about an expansion on the statistics thesaurus that was
started in this post. The entries in The Oxford Dictionary of Statistical Terms, 6th edition, edited by Yadolah Dodge was an inspiration to update the thesaurus, and was used as a 'to do list' of terms to include.
Compared to the Oxford Dictionary, this thesaurus is much less
technical, and more cross-referenced. Perhaps with time, it could be a workable companion volume.
Because
the thesaurus is digital, it makes sense that the cross-references
include hyperlinks. Doing this manually is problematic, however. If I
only include links to existing entries as I write them, then early
entries will either be missing links, or will need be checked a reread
as more entries are added. If I include links to entries that don't yet
exist in order to save time, then I risk leaving dead links to entries I
neglect to fill in later.
Instead, I've
written this cute little R program that scans the explanations in my
entries for mentions of names of other entries, and creates the
appropriate hyperlinks automatically.
For
example, the entry under "average" includes mentions of "mean" and
"median", both of which are also entries. This program identifies these
cross-references and updates the entry for "average" with HTML code to
make "mean" and "median" hyperlinks to anchors at their respective
entries. Furthermore, when making links, terms that include other terms in their names take priority, so a mention of "negative binomial distribution" will correctly link to that, and not to "binomial distribution" or "distribution".
Before:
Average: Colloquial term for "mean". Can also refer to any measure of centrality, such as the "median", or "trimmed mean", but much less often.
After:
<a name="AVERAGE">Average</a>: Colloquial term for "<a href="#MEAN">mean</a>". Can also refer to any measure of centrality, such as the "<a href="#MEDIAN">median</a>", or "<a href="#TRIMMED MEAN">trimmed mean</a>", but much less often.
It's not a perfect system. It fails
to make a hyperlink if the source entry text and destination entry
disagree on hyphenation or Canadian/American/British spelling. Common
words like "meaning" will produce false positives(although it's good
practice to avoid words like "meaning" when talking about statistics
anyway).. The program is, however
case-insensitive in its reference discovery. So far it's catching nearly
every reference and linking correctly, but there are only about 80 entries so
far.
Here are the Google drive files of the original text, the resulting HTML page and the R code.
Here is an older example of the HTML page that is output for these 80 entries.
Disagreements
on my thesaurus entries are encouraged. Please tell me if you think
something should be added, removed, or changed. I would much rather be
wrong temporarily than permanently.
No comments:
Post a Comment