Drupal Search Enhancement Update

|

Part 1 of my project (implementing synonym matching in the search index) is nearly completed, I am waiting for the patches to be accepted into core for drupal 6. In addition to synonym matching I also submitted a patch to index usernames with the nodes as requested in the Search group on drupal.org. The patches can be reviewed here, all comments welcome.

http://drupal.org/node/155262 - Taxonomy synonym search indexing
http://drupal.org/node/155254 - Username search indexing

For part 2 of my project I am to implement a fuzzy search engine in drupal.

I would like to produce a module that implements n-gram based fuzzy search capabilities. [For sequences of characters, the 3-grams (sometimes referred to as "trigrams") that can be generated from "good morning" are "goo", "ood", "od ", "d m", " mo", "mor" and so forth.] One of the main reasons why I would like to implement this specific type of fuzzy algorithm is its language independence. One of the major downsides to this implementation is the increase in size of the search index to:

SUM [length(word(i)) - length(n_gram) + 1]

instead of
SUM [word(i)]

Also, the current index has only a score based on the texts place in an assortment of tags; there will be a need for an additional column to score the trigram based on the size of the full word, lack of doing so would inherently give larger words a larger score in the results. Thus there needs to be a normalization factor added to the scores of the results, making an exact word match score of 1, and so forth. This can be done with the following simple equation

trigram score = 3/length(word)

Using sql we can sum the trigram scores on results HAVING the same nid. The benefits to doing such are that exact results will return higher scores and results in which a spelling mistake has occured will return a somewhat high score but not as high as one that was spelled correctly. This helps in instances where a simple change in one character can result in a completely different word/meaning.

I will follow up more this week as my work progresses, but my initial plans call for a seperate search index as not to interfere with the current search index.

One last thing, I'll be needing volunteers with some decent size sites to test my module out, so if you are interested please send me an email so that I can start gathering a beta group of testers to work with to modify my algorithm to return the best results.

Nice! I've bookmarked it

this is a good thing

this is a good thing

nice posting.

nice posting.

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.
  • You may use [inline:xx] tags to display uploaded files or images inline.
  • You may post code using <code>...</code> (generic) or <?php ... ?> (highlighted PHP) tags.

More information about formatting options

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.