com.evanmclean.evlib.text
Class FuzzyCompare

java.lang.Object
  extended by com.evanmclean.evlib.text.FuzzyCompare

public class FuzzyCompare
extends Object

Performs a fuzzy comparison between two strings to try and find similar strings.

Note: This class needs a recent version of Apache Commons Lang library.

The algorithm is roughly:

  1. Break each string up into a lexicon of unique words.
  2. Eliminate any words under the minimum length or are on the ignored list.
  3. Eliminate any words that do not have at least one character in them.
  4. For each of the remaining words in the first list, find the closest match based on the Levenstein Distance.
  5. Do the same for the remaining words in the second list.
  6. Return the average Levenstein Distance.

All comparisons are case insensitive (mainly by converting the lexicon to all lower case).

Author:
Evan McLean McLean Computer Services (see the overview for copyright and licensing.)

Constructor Summary
FuzzyCompare()
          Construct a new fuzzy comparator with the default ignored words and minimum word length.
FuzzyCompare(Collection<String> ignored_words)
          Construct a new fuzzy comparator with the default minimum word length.
FuzzyCompare(Collection<String> ignored_words, int min_word_length)
          Construct a new fuzzy comparator.
FuzzyCompare(int min_word_length)
          Construct a new fuzzy comparator with the default ignored words.
FuzzyCompare(String lhs)
          Construct a new fuzzy comparator with the default ignored words and minimum word length.
FuzzyCompare(String lhs, Collection<String> ignored_words)
          Construct a new fuzzy comparator with the default minimum word length.
FuzzyCompare(String lhs, Collection<String> ignored_words, int min_word_length)
          Construct a new fuzzy comparator.
FuzzyCompare(String lhs, int min_word_length)
          Construct a new fuzzy comparator with the default ignored words.
 
Method Summary
 void addIgnoredWords(Collection<String> ignored_words)
          Add the additional collection of words to the ignored words list.
 void addIgnoredWords(String... ignored_words)
          Add the additional collection of words to the ignored words list.
 double difference(String rhs)
          Perform the difference comparison against the specified string.
 double difference(String lhs, String rhs)
          Perform the difference comparison against the two strings.
 String[] getIgnoredWords()
          Returns the current set of ignored words.
 int getMinimumWordLength()
          The current minimum word length (default of 3).
 FuzzyLexicon makeLexicon(String str)
          Construct a lexicon of all the good words in the string.
 void setIgnoredWords(Collection<String> ignored_words)
          Set the ignored word set to the specified collection.
 void setLeft(String lhs)
          Set the left side to be compared.
 void setMinimumWordLength(int minimumWordLength)
          Set the minimum word length that will be used.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

FuzzyCompare

public FuzzyCompare()
Construct a new fuzzy comparator with the default ignored words and minimum word length.

See Also:
getMinimumWordLength()

FuzzyCompare

public FuzzyCompare(Collection<String> ignored_words)
Construct a new fuzzy comparator with the default minimum word length.

Parameters:
ignored_words - The set of words to be ignored.
See Also:
getMinimumWordLength()

FuzzyCompare

public FuzzyCompare(Collection<String> ignored_words,
                    int min_word_length)
Construct a new fuzzy comparator.

Parameters:
ignored_words - The set of words to be ignored.
min_word_length - Ignore words that are smaller than this length.
See Also:
getMinimumWordLength()

FuzzyCompare

public FuzzyCompare(int min_word_length)
Construct a new fuzzy comparator with the default ignored words.

Parameters:
min_word_length - Ignore words that are smaller than this length.
See Also:
getMinimumWordLength()

FuzzyCompare

public FuzzyCompare(String lhs)
Construct a new fuzzy comparator with the default ignored words and minimum word length.

Parameters:
lhs - The left string to compare.
See Also:
getMinimumWordLength()

FuzzyCompare

public FuzzyCompare(String lhs,
                    Collection<String> ignored_words)
Construct a new fuzzy comparator with the default minimum word length.

Parameters:
lhs - The left string to compare.
ignored_words - The set of words to be ignored.
See Also:
getMinimumWordLength()

FuzzyCompare

public FuzzyCompare(String lhs,
                    Collection<String> ignored_words,
                    int min_word_length)
Construct a new fuzzy comparator.

Parameters:
lhs - The left string to compare.
ignored_words - The set of words to be ignored.
min_word_length - Ignore words that are smaller than this length.

FuzzyCompare

public FuzzyCompare(String lhs,
                    int min_word_length)
Construct a new fuzzy comparator with the default ignored words.

Parameters:
lhs - The left string to compare.
min_word_length - Ignore words that are smaller than this length.
Method Detail

addIgnoredWords

public void addIgnoredWords(Collection<String> ignored_words)
Add the additional collection of words to the ignored words list.

Parameters:
ignored_words -

addIgnoredWords

public void addIgnoredWords(String... ignored_words)
Add the additional collection of words to the ignored words list.

Parameters:
ignored_words -

difference

public double difference(String rhs)
Perform the difference comparison against the specified string. The left string must already have been set.

Parameters:
rhs - The string to compare against.
Returns:
The difference between the strings.
See Also:
difference(String, String), setLeft(String)

difference

public double difference(String lhs,
                         String rhs)
Perform the difference comparison against the two strings. The left string is remembers and can be used for future comparisons.

Parameters:
lhs - The left string to compare.
rhs - The right string to compare.
Returns:
The difference between the strings.
See Also:
difference(String), setLeft(String)

getIgnoredWords

public String[] getIgnoredWords()
Returns the current set of ignored words.

Returns:
The current set of ignored words.

getMinimumWordLength

public int getMinimumWordLength()
The current minimum word length (default of 3).

Returns:
The current minimum word length.

makeLexicon

public FuzzyLexicon makeLexicon(String str)
Construct a lexicon of all the good words in the string.

Parameters:
str - The string to process.
Returns:
A new lexicon.

setIgnoredWords

public void setIgnoredWords(Collection<String> ignored_words)
Set the ignored word set to the specified collection.

Parameters:
ignored_words -

setLeft

public void setLeft(String lhs)
Set the left side to be compared.

Parameters:
lhs - The string to be compared.
See Also:
difference(String), difference(String, String)

setMinimumWordLength

public void setMinimumWordLength(int minimumWordLength)
Set the minimum word length that will be used.

Parameters:
minimumWordLength -