Part 1 of my project (implementing synonym matching in the search index) is nearly completed, I am waiting for the patches to be accepted into core for drupal 6. In addition to synonym matching I also submitted a patch to index usernames with the nodes as requested in the Search group on drupal.org. The patches can be reviewed here, all comments welcome.
http://drupal.org/node/155262 - Taxonomy synonym search indexing
http://drupal.org/node/155254 - Username search indexing
For part 2 of my project I am to implement a fuzzy search engine in drupal.
I would like to produce a module that implements n-gram based fuzzy search capabilities. [For sequences of characters, the 3-grams (sometimes referred to as "trigrams") that can be generated from "good morning" are "goo", "ood", "od ", "d m", " mo", "mor" and so forth.] One of the main reasons why I would like to implement this specific type of fuzzy algorithm is its language independence. One of the major downsides to this implementation is the increase in size of the search index to:
SUM [length(word(i)) - length(n_gram) + 1]instead of
SUM [word(i)]Also, the current index has only a score based on the texts place in an assortment of tags; there will be a need for an additional column to score the trigram based on the size of the full word, lack of doing so would inherently give larger words a larger score in the results. Thus there needs to be a normalization factor added to the scores of the results, making an exact word match score of 1, and so forth. This can be done with the following simple equation
trigram score = 3/length(word)
Using sql we can sum the trigram scores on results HAVING the same nid. The benefits to doing such are that exact results will return higher scores and results in which a spelling mistake has occured will return a somewhat high score but not as high as one that was spelled correctly. This helps in instances where a simple change in one character can result in a completely different word/meaning.
I will follow up more this week as my work progresses, but my initial plans call for a seperate search index as not to interfere with the current search index.
One last thing, I'll be needing volunteers with some decent size sites to test my module out, so if you are interested please send me an email so that I can start gathering a beta group of testers to work with to modify my algorithm to return the best results.







Nice! I've bookmarked it
Nice! I've bookmarked it http://www.searchallinone.com/Other/Drupalrsquos_elusive_login_issue__Al... :D
this is a good thing
this is a good thing
nice posting.
nice posting.
Post new comment