Create a wordlist generator

This is a little guide for making a wordlist generator. It will give you some programming practice as well as some understanding in combinatorics. I will use Java, but the way should be understandable for coders who don't know Java. But I take as a given that you understand numeral systems (at least the binary one).

The wordlist generator shall be fast and generate all possible words for a given alphabet and a wordlength. So we can begin with defining them. We keep it simple, so we can easily prove the results later. We generate words with a length of 3 and the alphabet will contain only 0 and 1. It is important to prepare it in a way that we can change these later without any problems.

The sceleton of our generator with the first local variables defined looks like this:

Code: Java

public class WordListGen {

public static void main(String[] args) {
generate();
}

private static void generate() {
int wordlength = 3;
char[] alphabet = { '1', '0' };
}
}


Before we really start with programming let's make sure you understand what the result has to be in the end.
A wordlength of three means we have three positions for a word to fill with characters: _ _ _
The alphabet says which characters can fill the positions. Doing this manually we have the following possibilities to arrange the characters:

000
001
010
011
100
101
110
111

Those are eight possible words. As I said you should have an understanding of numeral systems. If so, you can easily see that the results are the numbers from 0 to 7 (decimal) in the binary system. Getting all results for this particular generation just needs us to count from 0 to 7 and translate this to binary.

But we want to make the generator flexible so the alphabet can be easily changed. So we need a solution that works for everything.
How do we get the number of results?
Now we dive into combinatorics. We can compare this situation to an urn that contains the characters '1' and '0'. We have three turns and every single turn we take a character, write it on a paper and put it back to the urn. The order of the results is important, because "001" is another word than "100". That means our case is a variation with repetition (<- because we put the characters back every turn). We can get the number of results by using the formula for variation with repetition: N^k

k is the number of positions or the wordlength.
N is the number of characters in the alphabet.

Applied to our current situation we have characters^wordlength = 2^3 = 8 results and that is correct.
Let's compute this number in our program. For Java there is Math.pow to do exponentiation.

Code: Java

final long MAX_WORDS = (long) Math.pow(alphabet.length,
wordlength);


What we do next is just counting from 0 to (MAX_WORDS - 1) which is 0 to 7. This produces our words as decimal numbers:

Code: Java

for (long i = 0; i < MAX_WORDS; i++) {
}


But we don't want the decimals. We want the numbers in our numeral system. The radix of the numeral system is the number of characters, our case 2 which is the binary system. We compute the radix like that:

Code: Java

final int RADIX = alphabet.length;


Most languages provide out-of-the-box functions to translate numbers to a given numeral system. However, we only need integer representations and no letters. I.e. the hexadecimal system (radix 16) uses 0-9 and the letters A-F. We use 0-15 instead for this example. This way we make sure that we can use every radix and thus large alphabets that may contain more than 35 characters. Also making our own method for that is much more efficient.

Code: Java

private static int[] convertToRadix(int radix, long number, int wordlength) {
int[] result = new int[wordlength];
for (int i = wordlength - 1; i >= 0; i--) {
if (number > 0) {
int rest = (int) (number % radix);
number /= radix;
result[i] = rest;
} else {
result[i] = 0;
}

}
return result;
}


To explain how this conversion method works I use an example: number = 5, radix = 2, wordlength = 3
First we create our array of the length 3. So we provide three places to put numbers in: _ _ _
We run through this array backwards. Our number is not 0, so we divide the number by the radix. We put the rest of this operation (computed via the modul operator %) into the array:
number / radix = 5 / 2 = 2 rest 1
Our array: _ _ 1
The new number is the result of the division = 2

Now we repeat that:
number / radix = 2 / 2 = 1 rest 0
Our array: _ 0 1
The new number is the result of the division = 1

Last turn:
number / radix = 1 / 2 = 0 rest 1
Our array: 1 0 1

And this is the correct result for our decimal to binary conversion.

The whole code by now:

Code: Java

private static void generate() {
int wordlength = 3;
char[] alphabet = { '0', '1' };
final long MAX_WORDS = (long) Math.pow(alphabet.length, wordlength);
final int RADIX = alphabet.length;

for (long i = 0; i < MAX_WORDS; i++) {
int[] indices = convertToRadix(RADIX, i, wordlength);
for(int index : indices){
System.out.print(index);
}
System.out.println();
}
}


Our output looks promising:
000
001
010
011
100
101
110
111

Now let's alter the alphabet. We choose the characters a and b:

Code: Java

char[] alphabet = { 'a', 'b' };


If you have understood the code you will know that the output remains the same (the binaries above). The characters of the alphabet are never used. What we print out are the indices of the characters. We can easily change that by getting the characters of the alphabet and saving them into a char-array named word:

Code: Java

char[] word = new char[wordlength];
for (int k = 0; k < wordlength; k++) {
word[k] = alphabet[indices[k]];
}
System.out.println(word);


This is the whole program:

Code: Java

public class WordListGen {

public static void main(String[] args) {
generate();
}

private static void generate() {
int wordlength = 3;
char[] alphabet = { 'a', 'b' };
final long MAX_WORDS = (long) Math.pow(alphabet.length, wordlength);
final int RADIX = alphabet.length;

for (long i = 0; i < MAX_WORDS; i++) {
int[] indices = convertToRadix(RADIX, i, wordlength);
char[] word = new char[wordlength];
for (int k = 0; k < wordlength; k++) {
word[k] = alphabet[indices[k]];
}
System.out.println(word);
}
}

private static int[] convertToRadix(int radix, long number, int wordlength) {
int[] indices = new int[wordlength];
for (int i = wordlength - 1; i >= 0; i--) {
if (number > 0) {
int rest = (int) (number % radix);
number /= radix;
indices[i] = rest;
} else {
indices[i] = 0;
}

}
return indices;
}
}


And this our output:
aaa
aab
aba
abb
baa
bab
bba
bbb

That is it. You may test this further with other wordlengths and alphabets. Whatever you do: Happy coding

No comments:

Post a Comment