|
|
Discuss
>
WebGUI Dev
|
|
|
User
|
snapcount
|
|
Date
|
11/29/2006 6:15 pm
|
|
Views
|
2828
|
|
Rating
|
2
Rate [ | ]
|
|
|
Previous
·
Next
|
snapcount
|
Date: 11/29/2006 6:15 pm · Subject: Search Stop Words · Rating: 2
Please reference this bug http://www.plainblack.com/bugs/tracker/
search-not-returning-all-results#HICVkpdho4c1FF58xzSsVA
MySQL has a list of built-in stopwords to it's fulltext system that
will automatically be ignored from search queries. This is not a
problem, it's a feature... a nice feature actually. The problem is
that there is no sane way to figure out what words mysql is excluding
from it's fulltext queries. The idea here is that we should be able
to tell a user, "Hey, we didn't use words x,y,z in your query" so
they know.
So after lots and lots of research I determined that there is no way
to get this list of stopwords using any kind of mysql query.
(someone please prove me wrong). You can find the list in the docs,
but the list can change from version to version.
Enough editorial:
Problem: Need to determine what stop words are being used
pragmatically.
Why: So we can tell people when search terms are discarded.
1) Possible solution:
Create our own stopwords file and tell mysql to use it. This would
help us because we would know for sure which words were being used.
Drawbacks - The solution is outside of the WebGUI domain and in the
mysql domain meaning our search would require a specific mysql
configuration to work. This would affect all databases using
fulltext indexing on the machine. Also, the change requires
restarting mysql and rebuilding all of the fulltext indexes to take
effect. It's also kind of kludgy.
2) Possible solution:
Tell mysql to not use stopwords and implement configurable stopwords
using WebGUI.
Lots of Benefits here:
a) Stop words and settings could be configured on a site by site
basis through the UI.
b) The stop words are independent of each site and more importantly
of other databases that may not be for WebGUI.
c) Easy to implement... basically just running regexes on our
keywords before the query is executed, also lots of CPAN modules out
there for this.
Drawbacks:
a) Still requires turning mysql stopword settings off. Requires
custom config params in the my.conf file and a db server restart and
index rebuild to disable built-in stopwords.
b) might be slower... not sure.
Also, I know this means little here but this is how other
applications are handling this problem. Another possible solution is
to submit an RFE to mysql asking for the ability to query stopwords
or query for stopword hits from the previous query.
What do you guys think?
-Roy
|
| Back to Top |
Rate [ | ]
|
| |
len
|
Date: 11/30/2006 12:50 am · Subject: Re: Search Stop Words · Rating: 4
MySQL is not only ignoring stopwords from search queries, but also while
indexing. That's why you have to run a "REPAIR TABLE" after you change te
stopword list.
There's another problem with the default stopword list in MySQL: It's in
English :)
There are English stopwords that have a different meaning in another
language.
MySQL stopwords are system wide so you can't use an English stopword list
for database A and a French stopword list for database B.
I would vote for not using any stopword list at all.
Also, a search for "to be or not to be" will find no results yet this is a
meaningful phrase.
There's another thing to mention: MySQL, by default, is using a minimum
word length of 3 (ft_min_word_len in my.cnf). You might set that to
something lower.
Len
> snapcount wrote:<br /><br />
>
> Please reference this bug http://www.plainblack.com/bugs/tracker/ <br>
> search-not-returning-all-results#HICVkpdho4c1FF58xzSsVA<br>
> <br>
> MySQL has a list of built-in stopwords to it's fulltext system that
> <br>
> will automatically be ignored from search queries. This is not a
> <br>
> problem, it's a feature... a nice feature actually. The problem
> is <br>
> that there is no sane way to figure out what words mysql is excluding
> <br>
> from it's fulltext queries. The idea here is that we should be
> able <br>
> to tell a user, "Hey, we didn't use words x,y,z in your
> query" so <br>
> they know.<br>
> <br>
> So after lots and lots of research I determined that there is no way
> <br>
> to get this list of stopwords using any kind of mysql query. <br>
> (someone please prove me wrong). You can find the list in the docs,
> <br>
> but the list can change from version to version.<br>
> <br>
> Enough editorial:<br>
> <br>
> Problem: Need to determine what stop words are being used <br>
> pragmatically.<br>
> <br>
> Why: So we can tell people when search terms are discarded.<br>
> <br>
> 1) Possible solution:<br>
> <br>
> Create our own stopwords file and tell mysql to use it. This would
> <br>
> help us because we would know for sure which words were being used.<br>
> <br>
> Drawbacks - The solution is outside of the WebGUI domain and in the
> <br>
> mysql domain meaning our search would require a specific mysql <br>
> configuration to work. This would affect all databases using
> <br>
> fulltext indexing on the machine. Also, the change requires
> <br>
> restarting mysql and rebuilding all of the fulltext indexes to take
> <br>
> effect. It's also kind of kludgy.<br>
> <br>
> 2) Possible solution:<br>
> <br>
> Tell mysql to not use stopwords and implement configurable stopwords
> <br>
> using WebGUI.<br>
> <br>
> Lots of Benefits here:<br>
> <br>
> a) Stop words and settings could be configured on a site by site
> <br>
> basis through the UI.<br>
> b) The stop words are independent of each site and more importantly
> <br>
> of other databases that may not be for WebGUI.<br>
> c) Easy to implement... basically just running regexes on our <br>
> keywords before the query is executed, also lots of CPAN modules out
> <br>
> there for this.<br>
> <br>
> Drawbacks:<br>
> <br>
> a) Still requires turning mysql stopword settings off. Requires
> <br>
> custom config params in the my.conf file and a db server restart and
> <br>
> index rebuild to disable built-in stopwords.<br>
> b) might be slower... not sure.<br>
> <br>
> Also, I know this means little here but this is how other <br>
> applications are handling this problem. Another possible solution is
> <br>
> to submit an RFE to mysql asking for the ability to query stopwords
> <br>
> or query for stopword hits from the previous query.<br>
> <br>
> What do you guys think?<br>
> <br>
> -Roy<br>
> <br>
> <br>
> <br /><br />
>
> <a
> href="http://www.plainblack.com/webgui/dev/discuss/search-stop-words">http://www.plainblack.com/webgui/dev/discuss/search-stop-words</a><p><a
> href="http://www.plainblack.com/webgui/dev/discuss?func=unsubscribe">Unsubscribe</a></p>
>
> --
>
> Plain Black, makers of WebGUI
> http://plainblack.com
>
|
| Back to Top |
Rate [ | ]
|
| |
snapcount
|
Date: 11/30/2006 9:20 am · Subject: Re: Search Stop Words · Rating: 10
On Nov 30, 2006, at 1:51 AM, <len@primaat.com> wrote:
> len wrote:
>
> I would vote for not using any stopword list at all.
> Also, a search for "to be or not to be" will find no results yet
> this is a
> meaningful phrase.
>
You raised some very good points. I think stopwords are a useful
feature for sites with a lot of content to keep their searches from
taxing the server to death. By moving the stopwords feature into
WebGUI, all of your concerns are still addressed because one of the
options would be "Disable Stop Words"
> There's another thing to mention: MySQL, by default, is using a
> minimum
> word length of 3 (ft_min_word_len in my.cnf). You might set that to
> something lower.
>
Another good point, not sure if I got this across in my last message,
but this would also be configurable in WebGUI. You could set this
for min,max (0 for off entirely) or any length you deemed appropriate.
It sucks that in order to do this, we still have to have mysql conf
settings for ft_stopword_list="", ft_min_word_len=0, etc to keep
mysql out of the picture. The only way to avoid this is to not use
fulltext indexes and the entire search system is written around that
so it doesn't seem like a good solution. There are other options out
there like DBIx::FullText which also makes managing stopwords very
easy. Keeping things in perspective, I don't think this is an issue
that warrants a complete re-write of the search system, but requiring
mysql settings doesn't seem un-reasonable... besides, if they don't
turn it off we can detect it and have the interface in WebGUI
disabled, greyed out, whatever saying "Hey, go turn stopwords off in
MySQL".
-Roy
>
|
| Back to Top |
Rate [ | ]
|
| |
susanb
|
Date: 11/30/2006 12:55 pm · Subject: Re: Search Stop Words · Rating: 5
On Nov 30, 2006, at 7:20 AM, <roy@plainblack.com>
<roy@plainblack.com> wrote:
>
> It sucks that in order to do this, we still have to have mysql conf
> settings for ft_stopword_list="", ft_min_word_len=0, etc to keep
> mysql out of the picture. The only way to avoid this is to not use
> fulltext indexes and the entire search system is written around that
> so it doesn't seem like a good solution. There are other options out
> there like DBIx::FullText which also makes managing stopwords very
> easy. Keeping things in perspective, I don't think this is an issue
> that warrants a complete re-write of the search system, but requiring
> mysql settings doesn't seem un-reasonable... besides, if they don't
> turn it off we can detect it and have the interface in WebGUI
> disabled, greyed out, whatever saying "Hey, go turn stopwords off in
> MySQL".
>
I'm just a lurker, not a programmer, but for the WRE would it make
sense to add questions about minimum stopword length and perhaps
stopword list to the other questions asked in setup and then
configure my.conf appropriately. Is there any performance benefit to
mysql being configured "correctly" vs. doing it within webgui?
-Susan
|
| Back to Top |
Rate [ | ]
|
| |
snapcount
|
Date: 11/30/2006 1:15 pm · Subject: Re: Search Stop Words · Rating: -1
On Nov 30, 2006, at 1:55 PM, <susan@cdl.edu> wrote:
> susanb wrote:
>
> I'm just a lurker, not a programmer, but for the WRE would it make
> sense to add questions about minimum stopword length and perhaps
> stopword list to the other questions asked in setup and then
> configure my.conf appropriately. Is there any performance benefit to
> mysql being configured "correctly" vs. doing it within webgui?
>
It's nice to hear a fresh voice on the list =) We could do that for
the WRE setup but the drawback is those settings are mysql server
wide. So, you may have for example one site in English and one in
Dutch or one site that would use the common stop words and another
that may be about history for example. As Len pointed out, "To be or
not to be" would be a valid phrase on a history site but on all your
other sites those words should be ignored.
As far as speed I don't know. We'd have to do some benchmarking to
find out. As I noted earlier however, this would be optional because
we can run a query to tell if the stopword settings are enabled on
the mysql server from WebGUI. So if someone wanted to continue to
use mysql for stopword processing they could, or they could disable
it and WebGUI's stopword handling would kick in. I suspect WebGUI
could do it pretty quick as Perl is very good at text processing.
-Roy
|
| Back to Top |
Rate [ | ]
|
| |
snapcount
|
Date: 12/1/2006 2:40 pm · Subject: Re: Search Stop Words · Rating: 7
Ok... this bug has obviously transformed into or currently requires an RFE to fix. I posted one here so if this is a feature you're interested in seeing implemented transfer some karma in so we can work on it. Also, if there are any devs interested in working on this let me know... we just need JT to approve the RFE first.
|
| Back to Top |
Rate [ | ]
|
| |
DBell
|
Date: 12/1/2006 2:54 pm · Subject: Re: Search Stop Words · Rating: 1
I'll post this here, as it is more discussion: It could also be possible to bypass the MySQL fulltext-search system
entirely and do indexing by keyword, though that would be a lot more
work. It would take up more disk space, but it might be
faster to process. Since the searching itself is already done, you'd just have to
select the keywords.
|
| Back to Top |
Rate [ | ]
|
| |
|
|
OReilly by Albert2 - Fri @ 09:09am Glad to be here by patspam - Fri @ 01:59am Re: New Default Templates: community input by patspam - Thu @ 06:22pm
|