Just the Words, Please
In various kinds of textual analysis scripts, you sometimes need just the words ().
I know two ways to do this. The deroff command was designed to strip out troff () constructs and punctuation from files. The command deroff -w will give you a list of just the words in a document; pipe to sort -u () if you want only one of each.
deroff has one major failing, though. It only considers a word to be a string of characters beginning with a letter of the alphabet. A single character won't do, which leaves out one-letter words like the indefinite article "A."
A substitute is tr (), which can perform various kinds of character-by-character conversions.
To produce a list of all the individual words in a file, type:
< |
% |
---|
The -c option "complements" the first string passed to tr; -s squeezes out repeated characters. This has the effect of saying: "Take any non-alphabetic characters you find (one or more) and convert them to newlines (\012)."
(Wouldn't it be nice if tr just recognized standard UNIX regular expression syntax ()? Then, instead of -c A-Za-z
, you'd say '[^A-Za-z]'
. It's not any less obscure, but at least it's used by other programs, so there's one less thing to learn.)
The System V version of tr () has slightly different syntax. You'd get the same effect with:
%tr -cs '[A-Z][a-z]' '[\012*]' <
file
- TOR