Fuzzysearch Module Initial Release

|

The initial release of the fuzzy search engine model is finally here. I have implemented it here on my blog for anyone that would like to give it a quick try. To download the module visit the official release page. I'd really like to get some test users so that I can get some feedback on the results and performance.

Thus far, I have been able to test it on my blog which has a limited amount of content. However, preliminary results are showing well on speed performance over search.module. I think this mostly has to do with not using temporary tables and not using a call to node_load on each of the results, I chose to join the teaser and title from {node_revisions} and {node} respectively.

Features include:

  • Mispellings and typos still provide relevant results.
  • External scoring factor hooks exposed so contrib modules can give administrators options for scoring.
  • Reindex function available to allow modules to specifically call a certain node for reindexing at next cron run.
  • Indexing of CCK textfield field types and taxonomy terms.
  • Implements hook_nodeapi's 'update index' op, so current modules integrating with search.module will work the same.
  • Improved search performance over search.module because there are no temporary tables created during search.

Installation provides a block that should display just the same as the normal search.module, as well I've included a few empty css selectors to help anyone trying to modify the block display.

Main items I would like some feedback on are:

  • Scoring and relevancy - Are the returned results what they should be?
  • Result display - Are the results being displayed in an attractive manner? Is information missing that is needed?
  • How well the module integrates with forum.module and other content related modules. I already included hook calls to nodeapi's 'update index' so that any module currently returning information to the indexer will continue to do so. As well I have included CCK text field indexing automatically.

The deadline for the final submission for Summer of Code is nearing but that doesn't mean the development of future features will not take place. Please share ideas on things that you think could help make this module provide a great search experience for all users.

There currently is no pager

There currently is no pager available on the search results. This should be taken care of by next release.

Also, the currently available score modifier, doesn't actually do anything since it applies the same score to each node. I had left this in there as an example of what it would look like when one installed a module that provides such indexing score modifiers. I will be looking to release a few contribs to this in the near future.

Excellent work!

Excellent work!

First, thanks for this

First, thanks for this excellent module! I have been eagerly waiting when you get the first release out :)

I tested fuzzysearch in the site I'm building at the moment. It has plenty of custom CCK nodetypes, with taxonomy terms. etc. So far, search works pretty well (using default settings) and results are sensible. I'm planning to do more testing using different n-gram lengths - as at the moment I don't recall what's the optimal n-gram length for the language used in the site (Finnish).

It seems that search keys which have non-english characters as ÄÅÄÖ do not work yet. I get the following error when searching using key "äiti" (means mother in Finnish). My local dev envirenment is MAMP with default settings (MySQL 5.0.19, MySQL connection collation: utf_general_ci)

-----
user warning: Illegal mix of collations (latin1_swedish_ci,IMPLICIT) and (utf8_general_ci,COERCIBLE) for operation '=' query: SELECT sid, SUM(score) score, SUM(completeness) completeness, SUM(total) total, n.title, nr.teaser FROM search_fuzzy_index f LEFT JOIN node n ON (n.nid = f.sid) LEFT JOIN node_revisions nr ON (n.vid = nr.vid) WHERE trigram = 'äi' OR trigram = '?it' OR trigram = 'iti' GROUP BY sid ORDER BY score DESC in [path in dev server]includes/database.mysql.inc on line 172.

Matti thanks for the follow

Matti thanks for the follow up, I'm excited to have someone testing this with a site using a foreign language. I've found with english that ngram length of 3 is pretty much the best you'll get in terms of providing results when there are misspellings or typos because going to 4 or 5 means the user has to have a longer string of correct characters for the word they are searching.

In terms of the collation mix, I'll be looking into this immediately and will post a patch/solution asap.

The international/utf8

The international/utf8 characters issue has been resolved and updated code is now posted in the development snapshot at http://drupal.org/project/fuzzysearch.

Works fine now, thanks! I'll

Works fine now, thanks! I'll keep on testing and post reports to the issue queue.

Testing fuzzy search on a

Testing fuzzy search on a site with over 40k nodes, and the cron has trouble when indexing the site, it takes ages and in the end I get a white blank page instead of "cron has run". I'm afraid it times out during indexing.

Nice! I've bookmarked it

thanks for this

thanks for this

nice work .

nice work .

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.
  • You may use [inline:xx] tags to display uploaded files or images inline.
  • You may post code using <code>...</code> (generic) or <?php ... ?> (highlighted PHP) tags.

More information about formatting options

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.

Upcoming events

  • No upcoming events available