Click here to register.
      
PBWG Banner


     Discuss > WebGUI Dev

Search Stop Words

User snapcount
Date 11/29/2006 6:15 pm
Views 2828
Rating 2    Rate [
|
]
Previous · Next
User Message
snapcount
Please reference this bug http://www.plainblack.com/bugs/tracker/
search-not-returning-all-results#HICVkpdho4c1FF58xzSsVA

MySQL has a list of built-in stopwords to it's fulltext system that  
will automatically be ignored from search queries.  This is not a  
problem, it's a feature... a nice feature actually.  The problem is  
that there is no sane way to figure out what words mysql is excluding  
from it's fulltext queries.  The idea here is that we should be able  
to tell a user, "Hey, we didn't use words x,y,z in your query" so  
they know.

So after lots and lots of research I determined that there is no way  
to get this list of stopwords using any kind of mysql query.  
(someone please prove me wrong).  You can find the list in the docs,  
but the list can change from version to version.

Enough editorial:

Problem:  Need to determine what stop words are being used  
pragmatically.

Why:  So we can tell people when search terms are discarded.

1) Possible solution:

Create our own stopwords file and tell mysql to use it.  This would  
help us because we would know for sure which words were being used.

Drawbacks - The solution is outside of the WebGUI domain and in the  
mysql domain meaning our search would require a specific mysql  
configuration to work.  This would affect all databases using  
fulltext indexing on the machine.  Also, the change requires  
restarting mysql and rebuilding all of the fulltext indexes to take  
effect.  It's also kind of kludgy.

2) Possible solution:

Tell mysql to not use stopwords and implement configurable stopwords  
using WebGUI.

Lots of Benefits here:

a) Stop words and settings could be configured on a site by site  
basis through the UI.
b) The stop words are independent of each site and more importantly  
of other databases that may not be for WebGUI.
c) Easy to implement... basically just running regexes on our  
keywords before the query is executed, also lots of CPAN modules out  
there for this.

Drawbacks:

a) Still requires turning mysql stopword settings off.  Requires  
custom config params in the my.conf file and a db server restart and  
index rebuild to disable built-in stopwords.
b) might be slower... not sure.

Also, I know this means little here but this is how other  
applications are handling this problem.  Another possible solution is  
to submit an RFE to mysql asking for the ability to query stopwords  
or query for stopword hits from the previous query.

What do you guys think?

-Roy




Back to Top
Rate [
|
]
 
 
len
MySQL is not only ignoring stopwords from search queries, but also while
indexing. That's why you have to run a "REPAIR TABLE" after you change te
stopword list.

There's another problem with the default stopword list in MySQL: It's in
English :)
There are English stopwords that have a different meaning in another
language.

MySQL stopwords are system wide so you can't use an English stopword list
for database A and a French stopword list for database B.

I would vote for not using any stopword list at all.
Also, a search for "to be or not to be" will find no results yet this is a
meaningful phrase.

There's another thing to mention: MySQL, by default, is using a minimum
word length of 3 (ft_min_word_len in my.cnf). You might set that to
something lower.

Len

> snapcount wrote:<br /><br />
>
> Please reference this bug http://www.plainblack.com/bugs/tracker/ <br>
> search-not-returning-all-results#HICVkpdho4c1FF58xzSsVA<br>
> <br>
> MySQL has a list of built-in stopwords to it&#39;s fulltext system that
> &nbsp;<br>
> will automatically be ignored from search queries. &nbsp;This is not a
> &nbsp;<br>
> problem, it&#39;s a feature... a nice feature actually. &nbsp;The problem
> is &nbsp;<br>
> that there is no sane way to figure out what words mysql is excluding
> &nbsp;<br>
> from it&#39;s fulltext queries. &nbsp;The idea here is that we should be
> able &nbsp;<br>
> to tell a user, &quot;Hey, we didn&#39;t use words x,y,z in your
> query&quot; so &nbsp;<br>
> they know.<br>
> <br>
> So after lots and lots of research I determined that there is no way
> &nbsp;<br>
> to get this list of stopwords using any kind of mysql query. &nbsp; <br>
> (someone please prove me wrong). &nbsp;You can find the list in the docs,
> &nbsp;<br>
> but the list can change from version to version.<br>
> <br>
> Enough editorial:<br>
> <br>
> Problem: &nbsp;Need to determine what stop words are being used &nbsp;<br>
> pragmatically.<br>
> <br>
> Why: &nbsp;So we can tell people when search terms are discarded.<br>
> <br>
> 1) Possible solution:<br>
> <br>
> Create our own stopwords file and tell mysql to use it. &nbsp;This would
> &nbsp;<br>
> help us because we would know for sure which words were being used.<br>
> <br>
> Drawbacks - The solution is outside of the WebGUI domain and in the
> &nbsp;<br>
> mysql domain meaning our search would require a specific mysql &nbsp;<br>
> configuration to work. &nbsp;This would affect all databases using
> &nbsp;<br>
> fulltext indexing on the machine. &nbsp;Also, the change requires
> &nbsp;<br>
> restarting mysql and rebuilding all of the fulltext indexes to take
> &nbsp;<br>
> effect. &nbsp;It&#39;s also kind of kludgy.<br>
> <br>
> 2) Possible solution:<br>
> <br>
> Tell mysql to not use stopwords and implement configurable stopwords
> &nbsp;<br>
> using WebGUI.<br>
> <br>
> Lots of Benefits here:<br>
> <br>
> a) Stop words and settings could be configured on a site by site
> &nbsp;<br>
> basis through the UI.<br>
> b) The stop words are independent of each site and more importantly
> &nbsp;<br>
> of other databases that may not be for WebGUI.<br>
> c) Easy to implement... basically just running regexes on our &nbsp;<br>
> keywords before the query is executed, also lots of CPAN modules out
> &nbsp;<br>
> there for this.<br>
> <br>
> Drawbacks:<br>
> <br>
> a) Still requires turning mysql stopword settings off. &nbsp;Requires
> &nbsp;<br>
> custom config params in the my.conf file and a db server restart and
> &nbsp;<br>
> index rebuild to disable built-in stopwords.<br>
> b) might be slower... not sure.<br>
> <br>
> Also, I know this means little here but this is how other &nbsp;<br>
> applications are handling this problem. &nbsp;Another possible solution is
> &nbsp;<br>
> to submit an RFE to mysql asking for the ability to query stopwords
> &nbsp;<br>
> or query for stopword hits from the previous query.<br>
> <br>
> What do you guys think?<br>
> <br>
> -Roy<br>
> <br>
> <br>
> <br /><br />
>
> <a
> href="http://www.plainblack.com/webgui/dev/discuss/search-stop-words">http://www.plainblack.com/webgui/dev/discuss/search-stop-words</a><p><a
> href="http://www.plainblack.com/webgui/dev/discuss?func=unsubscribe">Unsubscribe</a></p>
>
> --
>
> Plain Black, makers of WebGUI
> http://plainblack.com
>




Back to Top
Rate [
|
]
 
 
snapcount

On Nov 30, 2006, at 1:51 AM, <len@primaat.com> wrote:

> len wrote:
>
> I would vote for not using any stopword list at all.
> Also, a search for "to be or not to be" will find no results yet  
> this is a
> meaningful phrase.
>

You raised some very good points.  I think stopwords are a useful  
feature for sites with a lot of content to keep their searches from  
taxing the server to death.  By moving the stopwords feature into  
WebGUI, all of your concerns are still addressed because one of the  
options would be "Disable Stop Words"

> There's another thing to mention: MySQL, by default, is using a  
> minimum
> word length of 3 (ft_min_word_len in my.cnf). You might set that to
> something lower.
>

Another good point, not sure if I got this across in my last message,  
but this would also be configurable in WebGUI.  You could set this  
for min,max (0 for off entirely) or any length you deemed appropriate.

It sucks that in order to do this, we still have to have mysql conf  
settings for ft_stopword_list="", ft_min_word_len=0, etc to keep  
mysql out of the picture.  The only way to avoid this is to not use  
fulltext indexes and the entire search system is written around that  
so it doesn't seem like a good solution.  There are other options out  
there like DBIx::FullText which also makes managing stopwords very  
easy.  Keeping things in perspective, I don't think this is an issue  
that warrants a complete re-write of the search system, but requiring  
mysql settings doesn't seem un-reasonable... besides, if they don't  
turn it off we can detect it and have the interface in WebGUI  
disabled, greyed out, whatever saying "Hey, go turn stopwords off in  
MySQL".

-Roy

>


Back to Top
Rate [
|
]
 
 
susanb
On Nov 30, 2006, at 7:20 AM, <roy@plainblack.com>  
<roy@plainblack.com> wrote:
>
> It sucks that in order to do this, we still have to have mysql conf
> settings for ft_stopword_list="", ft_min_word_len=0, etc to keep
> mysql out of the picture.  The only way to avoid this is to not use
> fulltext indexes and the entire search system is written around that
> so it doesn't seem like a good solution.  There are other options out
> there like DBIx::FullText which also makes managing stopwords very
> easy.  Keeping things in perspective, I don't think this is an issue
> that warrants a complete re-write of the search system, but requiring
> mysql settings doesn't seem un-reasonable... besides, if they don't
> turn it off we can detect it and have the interface in WebGUI
> disabled, greyed out, whatever saying "Hey, go turn stopwords off in
> MySQL".
>
I'm just a lurker, not a programmer, but for the WRE would it make  
sense to add questions about minimum stopword length and perhaps  
stopword list to the other questions asked in setup and then  
configure my.conf appropriately. Is there any performance benefit to  
mysql being configured "correctly" vs. doing it within webgui?

-Susan



Back to Top
Rate [
|
]
 
 
snapcount

On Nov 30, 2006, at 1:55 PM, <susan@cdl.edu> wrote:

> susanb wrote:
>
> I'm just a lurker, not a programmer, but for the WRE would it make
> sense to add questions about minimum stopword length and perhaps
> stopword list to the other questions asked in setup and then
> configure my.conf appropriately. Is there any performance benefit to
> mysql being configured "correctly" vs. doing it within webgui?
>

It's nice to hear a fresh voice on the list =)  We could do that for  
the WRE setup but the drawback is those settings are mysql server  
wide.  So, you may have for example one site in English and one in  
Dutch or one site that would use the common stop words and another  
that may be about history for example.  As Len pointed out, "To be or  
not to be" would be a valid phrase on a history site but on all your  
other sites those words should be ignored.

As far as speed I don't know.  We'd have to do some benchmarking to  
find out.  As I noted earlier however, this would be optional because  
we can run a query to tell if the stopword settings are enabled on  
the mysql server from WebGUI.  So if someone wanted to continue to  
use mysql for stopword processing they could, or they could disable  
it and WebGUI's stopword handling would kick in.  I suspect WebGUI  
could do it pretty quick as Perl is very good at text processing.

-Roy


Back to Top
Rate [
|
]
 
 
snapcount

Ok... this bug has obviously transformed into or currently requires an RFE to fix.  I posted one here so if this is a feature you're interested in seeing implemented transfer some karma in so we can work on it.

Also, if there are any devs interested in working on this let me know... we just need JT to approve the RFE first. 



Back to Top
Rate [
|
]
 
 
DBell

I'll post this here, as it is more discussion:

It could also be possible to bypass the MySQL fulltext-search system entirely and do indexing by keyword, though that would be a lot more work.

It would take up more disk space, but it might be faster to process. Since the searching itself is already done, you'd just have to select the keywords. 

 



Back to Top
Rate [
|
]
 
 
     Discuss > WebGUI Dev




OReilly by Albert2 - Fri @ 09:09am

Glad to be here by patspam - Fri @ 01:59am

Re: WRE install on Ubuntu by SteveD - Fri @ 01:56am

Smoketest For nightly_2008-09-05 by Visitor - Fri @ 01:46am

Re: WRE install on Ubuntu by knowmad - Thu @ 07:37pm

Re: New Default Templates: community input by patspam - Thu @ 06:22pm

Re: WebGUI Resending mails by JT - Thu @ 05:27pm

WebGUI Resending mails by arjan - Thu @ 04:55pm

Re: Config File Changes by JT - Thu @ 03:40pm

Re: RSVP function in WebGUI? by knowmad - Thu @ 03:25pm

Re: Config File Changes by knowmad - Thu @ 03:11pm

Re: Config File Changes by JT - Thu @ 02:30pm

RSVP function in WebGUI? by pvanthony - Thu @ 02:13pm

Re: Config File Changes by JT - Thu @ 02:05pm