Fuzzysearch Major Update

|
fuzzysearch_admin.png

Over the past week and a half or so I have made much progress on my project. The following are main accomplishments:

Completed Items

Scoring Hooks

  • Scoring hooks to allow other module developers the opportunity to insert a score multiplier at the time of indexing. Scoring Hook API.
  • Scoring hooks modify the score of a node during indexing, and affect the score on each word being indexed per node. This is the most effective way to do such an operation because it requires no extra time during the search. It is also effective with this new search module because of the ability to tag nodes for re-indexing, whereas search.module only re-indexes if a node's updated timestamp has been changed.
  • Site administrators have control over how much of an affect each score modifier has through the control panel (Screen Shot below).

fuzzysearch_admin.png

Indexing Improvements

  • Html Tag scoring, just like that used in search.module, however I used a different regular expression. In search.module the regex separates the content at each tag, which needs a fix to catch unclosed tags. The approach I have taken pulls out any text between a tag, and only if that tag has a definite beginning and end.
  • More efficient indexing! I've made it so that each word is only indexed once. While this takes a bit more processing during the indexing phase because I have to reloop and collect each word into an array before indexing, the benefits are that the search index is smaller and it provides more accurate results for the completeness metric since the same word doesn't contribute to the completeness for every occurrence within a node.
  • Nodeapis $op = 'update_index' has been implemented in the indexing phase so that modules currently sending information to the indexer will work properly.

Administerable N-Gram Length

  • Variable length qgrams, the administrator will have easy access to change the length of the qgrams, i've allowed for 3, 4 or 5 as of now.
  • Indexing and searching will also now work for words that are shorter than the qgram length as well. This is great for when the admin sets the length to 5, it will ensure that smaller words are indexed and searchable.

To Do List

This Week

  • Finalize the re-index api functions that will allow other modules to easily tag a node as needing to be reindexed.
  • I need to work on the front end appearance of the search engine. The results need to be displayed with teasers below them. I currently have a theme function for outputting the search form, but I'd like to do the same for the results page and make sure it is user friendly for someone to modify and print the search form to any part of their site, this includes making a block for the search form.
  • I'd like to enable a stop words function, it will be administrable by the site admin whether to turn it off or on and they can choose the words they wish to not have indexed.
  • Allow people to search for exact phrases and use OR and AND to filter results (just like the current search does).

By Next Week

  • Get volunteers running tests on the results being returned and get reports on performance.
  • Use the results from the previous to fine tune the search query to enable the best matches being returned.

Blake, glad to see the work

Blake, glad to see the work on the SoC project is going well! When the time comes to test it out, let me know and I'll see if we can have Ubercart.org be a guinea pig for you. :)

Wow, that'd be great. It'll

Wow, that'd be great. It'll likely be ready for testing in the next few weeks. I'd really like to implement a score modifier for Ubercart so that certain products that sell more are given more credibility in the search results. Any other metrics for shopping carts would be great to provide as well, would love to get some other ideas.

Don't cry because it is

Don't cry because it is over,smile because it happened. ?????
???????
???
?????
??
??????????

loving what you done hear.

loving what you done hear.

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.
  • You may use [inline:xx] tags to display uploaded files or images inline.
  • You may post code using <code>...</code> (generic) or <?php ... ?> (highlighted PHP) tags.

More information about formatting options

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.

Upcoming events

  • No upcoming events available