[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[EP-tech] Plural words in search results



Hi John,

Thanks for the feedback... I think I was involved in the discussions years ago about the simple search etc...

This particular issue appears to be more subtle,
It removes the trailing 's' from the keywords (unless it is all caps).

Keyword     indexed term
platypus    platypu
Platypus    platypu
PLATYPUS    platypus

It removes the trailing 's' from the search term.. But if the keywords are entered in all caps, it doesn't remove the 's'  Then the search fails for that item (as it has removed the 's' from the search term).
Search Expression       'platypus'
Search performed using  'platypu'

Keyword     indexed term      Match
platypus    platypu           Yes
Platypus    platypu           Yes
PLATYPUS    platypus          No


I will dig into it, and let you know if there is more patching required ;)

Cheers

Matt.


From: eprints-tech-bounces at ecs.soton.ac.uk [mailto:eprints-tech-bounces at ecs.soton.ac.uk] On Behalf Of John Salter
Sent: Thursday, 1 June 2017 7:16 PM
To: eprints-tech at ecs.soton.ac.uk
Subject: Re: [EP-tech] Plural words in search results

Hi Matt,
I've looked into a similar issue in the past - and I think it was discussed on the tech list a few years ago.
I had added a fix (we're 3.3.10 too) - which recently was discovered to break things in a more subtle way.

If I remember the full story, it goes something like this:
The 'simple' search field is broken in vanilla EPrints 3.3.10 - as it doesn't strip out short-words.
This fix for this initially was to run the search terms via the $c->{extract_words} function (in cfg.d/indexer.pl).
This seemed to resolve the issue (we'd been running it like this for a few years), but we discovered that for a search field looking at multiple metafield types (e.g. a text field and a name field), if the search term ended in -ss it wouldn't find anything.

My current fix is: https://gist.github.com/jesusbagpuss/e096430c825d34a2ef1de671e8a7dfda
Both are 'patch' files (overwrite methods in the core EPrints modules - we try to keep these things separated - but you could just take the methods and edit the files they're patching directly).

There are two files - one resolves an issue with apostrophes in names (which may or may not affect you).

The issue you report is slightly different to the one we found - but I think the cause might be very similar - the stripping of a trailling 's' is applied during indexing, but the same is not applied when searching.

Hope that gets you somewhere - some of this stuff is fairly recent in my mind (fixing the fix took a bit of tracing through the modules) - there may be more useful stuff I have in my head!

Cheers,
John


From: eprints-tech-bounces at ecs.soton.ac.uk<mailto:eprints-tech-bounces at ecs.soton.ac.uk> [mailto:eprints-tech-bounces at ecs.soton.ac.uk] On Behalf Of Matthew Brady
Sent: 01 June 2017 06:09
To: eprints-tech at ecs.soton.ac.uk<mailto:eprints-tech at ecs.soton.ac.uk>
Subject: [EP-tech] Plural words in search results

Hi All,

One of our users came across a problem, when performing some keyword searches...  and assumed it was a case problem, since the all uppercase words in their testing weren't returning in the result set.

After testing, I have a preliminary diagnosis, (we are running 3.3.10 if it makes a difference).

It appears the index process is removing the 's' off the end of the word (unless the word is all caps).
When performing a search, the system removes the 's' from the search term, and performs the search... in our case this returns 2 of 3 test records.

When I took the last two letters off each eprint's keywords, and then performed a search, it returned all three records in the results..

+----------+----------+---------------+--------------------------+
|<-   details from eprint__rindex   ->|<- eprint.keywords field->|
+----------+----------+---------------+--------------------------+
| eprintid | field    | word          |     keywords             |
+----------+----------+---------------+--------------------------+
|    29533 | keywords | ornithorhynch |     ornithorhynch        |
|    29534 | keywords | ornithorhynch |     Ornithorhynch        |
|    29535 | keywords | ornithorhynch |     ORNITHORHYNCH        |
+----------+----------+---------------+--------------------------+

The plural determination holds true for the humble Platypus as well :(

+----------+----------+-----------------+---------------------------+
| eprintid | field    | word            | keywords                  |
+----------+----------+-----------------+---------------------------+
|    29533 | keywords | ornithorhynchu  | ornithorhynchus, platypus |
|    29533 | keywords | platypu         | ornithorhynchus, platypus |
|    29534 | keywords | ornithorhynchu  | Ornithorhynchus, Platypus |
|    29534 | keywords | platypu         | Ornithorhynchus, Platypus |
|    29535 | keywords | ornithorhynchus | ORNITHORHYNCHUS, PLATYPUS |
|    29535 | keywords | platypus        | ORNITHORHYNCHUS, PLATYPUS |
+----------+----------+-----------------+---------------------------+


Cheers

Matt.




_____________________________________________________________

This email (including any attached files) is confidential and is for the intended recipient(s) only. If you received this email by mistake, please, as a courtesy, tell the sender, then delete this email.



The views and opinions are the originator's and do not necessarily reflect those of the University of Southern Queensland. Although all reasonable precautions were taken to ensure that this email contained no viruses at the time it was sent we accept no liability for any losses arising from its receipt.



The University of Southern Queensland is a registered provider of education with the Australian Government.

(CRICOS Institution Code QLD 00244B / NSW 02225M, TEQSA PRV12081 )


_____________________________________________________________
This email (including any attached files) is confidential and is for the intended recipient(s) only. If you received this email by mistake, please, as a courtesy, tell the sender, then delete this email.

The views and opinions are the originator's and do not necessarily reflect those of the University of Southern Queensland. Although all reasonable precautions were taken to ensure that this email contained no viruses at the time it was sent we accept no liability for any losses arising from its receipt.

The University of Southern Queensland is a registered provider of education with the Australian Government.
(CRICOS Institution Code QLD 00244B / NSW 02225M, TEQSA PRV12081 )

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ecs.soton.ac.uk/pipermail/eprints-tech/attachments/20170601/f931077f/attachment-0001.html