Perls of Wisdom

Adrian Ward, Sidestream (adrian@sidestream.org)

(a better coded version of a similar idea was done by Randal L. Schwartz for Linux Magazine, Sep 1999)

Introduction

This short essay discusses a simple approach to encoding the construction of English sentences, and using a random generative approach to seed any number of varying sentences. The final product will be Perl script that produces the discussed output.

Mimic, not simulate

As with many generative systems, it is easier to produce code that mimics the behaviour of some other process. This is much easier way to generate creativity than to try to codify some abstract intangible process that goes on in one's head. This is where the study of artificial intelligence moves away from generative systems, but we are left with a simpler, more logical approach to generative creativity.

In this respect, we shall be seeking to mimic existing sentence structures rather than trying to build our own. The English language has a complex formal set of rules which dictate the use of certain types of words in certain places. We shall be avoiding the hassle of worrying about this by looking at existing sentences.

Our Basic Process

If we take any starting sentence, and then categorize each word into types, then store the references to those types (rather than the actual words) then we could start to build groups of words that are of the sample type. When reconstructing our sentences, the only thing we would need to do is choose any word from each requested group. If a group contained more than one word, then we could employ a random function to choose which word to use.

The primary goal of our system is to be able to have a series of pools of words, which it can pick related words out of and then build a new, similar sentence out of those chosen words.

In order to achieve this then, we could sit down and try to work out by hand all the different types of words and build our database manually. But by trying to automate the learning process too, we can make building this database a much easier task - one that doesn't necessarilly require a degree in English to complete.

Training

Let us take a sample sentence to being with:

I am happy

We can split this sentence into it's three component words, and then classify each of those words.

I Subjective Pronoun
am Verb - 'to be'
happyPostpositive Adjective

We could take another, similar sentence, and do the same thing.

She was sad

She
Subjective Pronoun
wasVerb - 'to be'
sadPostpositive Adjective

Obviously these examples have been chosen to have colliding word types, but we can now build three starting groups:

Subjective Pronouns: I/She
'To be' Verbs: am/was
Postpositive Adjectives:happy/sad

Now if we assume that our sentence structure will be

[Subjective Pronoun] ['To be' Verb] [Postpositive Adjective]

We can build a random combination of the two original sentences!

I was sad
I am sad
She am happy

It goes wrong here slightly because we weren't precise enough in declaring our 'to be' verbs - the 'am' is present tense and the 'was' is past tense. I also find it helps to be a little less precise about word definitions at this point. If, while categorizing your words, you'd rather group 'was' with 'will be' (which is still a clash of tenses, but is more suitable), you might find the permutations work a little better. This is a subjective matter. I chose to go down the more flexible route, because I felt that the results would be more unpredictable and suprising when they failed. You will find my examples don't therefor follow the traditional rules of English word categorization. I've also taken the liberty to choose easier to remember groupnames.

The Tutor

The following Perl code takes a sentence typed by the user and breaks down each word and, if it doesn't already recognise it, asks which word group it belongs in. I've used a hash array to store all the words, so that with one variable reference I can recall all the words in a particular group.

#!/usr/bin/perl

print "Please type your sentence: ";
$inSentence=<>;
chomp($inSentence);

dbmopen(%WORDS,"AutoPhraseWordTypes",0666);

@inSentenceWords=split(/ /,$inSentence);

for $inSentenceThisWord (@inSentenceWords) {
 undef $wordType;
 for $WORDSthisType (keys %WORDS) {
  if (("/".$WORDS{$WORDSthisType}."/") =~ /\/$inSentenceThisWord\//i) {
   $wordType=$WORDSthisType;
  }
 }
 if ($wordType eq undef) {
  print "\nUnknown word \"$inSentenceThisWord\". What type is it?\n";
  undef @TypeList;
  for (sort keys %WORDS) {
   print "  $_ (such as $WORDS{$_})\n";
  }
  print "Please provide the word's type name: ";
  $inType=<>;
  chomp($inType);
  $WORDS{$inType}.="/$inSentenceThisWord";
  $WORDS{$inType}=~s/^\///;
  $templateSentence.="[$inType] ";
 } else {
  $templateSentence.="[$wordType] ";
 }
}

$templateSentence=~s/ $//;
print "Sentence structure: $templateSentence\n";

dbmclose(%WORDS);

This script should produce a database file of a hash array which contains all the words the system has ever seen. What sentences you type into it, and how you choose to categorize them is down to you.

A typical entry in the hash array will look something like this

$WORDS{'to be verbs'}="am/was";

Later on, when we come to generate our own sentences, where we see a reference in our sentence structures to [to be verbs] we will simply split() the contents of that hash value on a / forward slash and will choose one at random.

One of the last things to do is to remember all the sentence structures we have typed in. You'll notice that after the script has queried any words it doesn't understand, it prints the sentence structure out. The final version of my tutor script also stores these in a hash array, for no reason other than it's very easy to tie a hash array to a database file. This is wasteful, because I'm only using the keys in the hash array, and not the actual values, but it's quick and easy, avoids duplicate entries (you can't have two entries in a hash with the same key), and besides, the three principal virtues of a programmer are Laziness, Impatience, and Hubris [1].

#!/usr/bin/perl

print "Please type your sentence: ";
$inSentence=<>;
chomp($inSentence);

dbmopen(%WORDS,"AutoPhraseWordTypes",0666);

@inSentenceWords=split(/ /,$inSentence);

for $inSentenceThisWord (@inSentenceWords) {
 undef $wordType;
 for $WORDSthisType (keys %WORDS) {
  if (("/".$WORDS{$WORDSthisType}."/") =~ /\/$inSentenceThisWord\//i) {
   $wordType=$WORDSthisType;
  }
 }
 if ($wordType eq undef) {
  print "\nUnknown word \"$inSentenceThisWord\". What type is it?\n";
  undef @TypeList;
  for (sort keys %WORDS) {
   push @TypeList,$_;
   print "  $#TypeList:$_ (such as $WORDS{$_})\n";
  }
  print "Please provide the type number, or provide a new type name: ";
  $inType=<>;
  chomp($inType);
  $inTypeFirstLetter=substr($inType,0,1);
  if ("0123456789" =~ /$inTypeFirstLetter/) {
   $inType=$TypeList[$inType];
  }
  $WORDS{$inType}.="/$inSentenceThisWord";
  $WORDS{$inType}=~s/^\///;
  $templateSentence.="[$inType] ";
 } else {
  $templateSentence.="[$wordType] ";
 }
}

dbmclose(%WORDS);

$templateSentence=~s/ $//;
print "Sentence structure: $templateSentence\n";

dbmopen(%SENTENCES,"AutoPhraseSentenceStructures",0666);
$SENTENCES{$templateSentence}=".";
dbmclose(%SENTENCES);

I also added a handy number selection system, because I found it was too easy to mistype the word group names. Also, bear in mind that everything here is case sensitive, so it's best for now to work all in lower case or something.

So now we've got our two data files, AutoPhraseWordTypes and AutoPhraseSentenceStructures, both databases of hash arrays, which contain all the knowledge we need to get going generating our own sentences. If you don't want to spend ages typing in and categorizing words by hand, then use the following text files and the txt2hash.pl utility to replicate my data files. There are a few grammatical problems with mine because I wasn't careful enough while grouping words together, but they suffice.

Download: AutoPhraseWT.txt, AutoPhraseSS.txt and txt2hash.pl

The generator

Having prepared the groundwork and generated our databases accurately enough, the simple job of choosing a sentence structure at random and then inserting randomly chosen words from the correct groups is a simple task.

#!/usr/bin/perl

dbmopen(%SENTENCES,"AutoPhraseSentenceStructures",0666);
for (keys %SENTENCES) {
 push @sentences,$_;
}
dbmclose(%SENTENCES);

$templateSentence=$sentences[rand($#sentences+1)].".";

dbmopen(%WORDS,"AutoPhraseWordTypes",0666);
for (keys %WORDS) {
 @choices=split(/\//,$WORDS{$_});
 while ($templateSentence=~/\[$_\]/) {
  $templateSentence=~s/\[$_\]/$choices[rand($#choices+1)]/;
 }
}
dbmclose(%WORDS);

$templateSentence=~s/^(.)/\u$1/;

print "$templateSentence\n";

This script works by firstly choosing a sentence structure from the ones stored in the keys of the AutoPhraseSentenceStructures database. Then, for each group in the AutoPhraseWordTypes database, it looks for occurances of references to that group in the sentence template, and replaces each occurance with a word chosen randomly from the appropriate group.

There are probably better and more efficient ways to do this, but this works nicely on a small scale, and it's still fairly easy to understand.

See it in action

A CGI version of this script is in action at http://generative.net/AutoPhrase.cgi. The same technique is used to build the <TITLE> tag on the generative.net homepage.

Notes and references

[1] man perl