EPrints Technical Mailing List Archive

Message: #06556


< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First

Re: [EP-tech] Plural words in search results


The 'extract_words' function does this.

Anything in all caps it treats as an acronym - and doesn't strip trailing 's' from the end.

 

I'm not sure if there's a sensible way round this - unless you want to somehow treat all keywords as non-acronyms (and lowercase them all before indexing)?

 

Cheers,

John

 

From: eprints-tech-bounces@ecs.soton.ac.uk [mailto:eprints-tech-bounces@ecs.soton.ac.uk] On Behalf Of Matthew Brady
Sent: 02 June 2017 00:11
To: eprints-tech@ecs.soton.ac.uk
Subject: Re: [EP-tech] Plural words in search results

 

Hi John,

 

Thanks for the feedback… I think I was involved in the discussions years ago about the simple search etc…

 

This particular issue appears to be more subtle,

It removes the trailing ‘s’ from the keywords (unless it is all caps).

 

Keyword     indexed term

platypus    platypu    

Platypus    platypu    

PLATYPUS    platypus   

 

It removes the trailing ‘s’ from the search term.. But if the keywords are entered in all caps, it doesn’t remove the ‘s’  Then the search fails for that item (as it has removed the ‘s’ from the search term).

Search _expression_       ‘platypus’

Search performed using  ‘platypu’

 

Keyword     indexed term      Match

platypus    platypu           Yes

Platypus    platypu           Yes

PLATYPUS    platypus          No

 

 

I will dig into it, and let you know if there is more patching required ;)

 

Cheers

 

Matt.

 

 

From: eprints-tech-bounces@ecs.soton.ac.uk [mailto:eprints-tech-bounces@ecs.soton.ac.uk] On Behalf Of John Salter
Sent: Thursday, 1 June 2017 7:16 PM
To: eprints-tech@ecs.soton.ac.uk
Subject: Re: [EP-tech] Plural words in search results

 

Hi Matt,

I've looked into a similar issue in the past - and I think it was discussed on the tech list a few years ago.

I had added a fix (we're 3.3.10 too) - which recently was discovered to break things in a more subtle way.

 

If I remember the full story, it goes something like this:

The 'simple' search field is broken in vanilla EPrints 3.3.10 - as it doesn't strip out short-words.

This fix for this initially was to run the search terms via the $c->{extract_words} function (in cfg.d/indexer.pl).

This seemed to resolve the issue (we'd been running it like this for a few years), but we discovered that for a search field looking at multiple metafield types (e.g. a text field and a name field), if the search term ended in -ss it wouldn't find anything.

 

My current fix is: https://gist.github.com/jesusbagpuss/e096430c825d34a2ef1de671e8a7dfda

Both are 'patch' files (overwrite methods in the core EPrints modules - we try to keep these things separated - but you could just take the methods and edit the files they're patching directly).

 

There are two files - one resolves an issue with apostrophes in names (which may or may not affect you).

 

The issue you report is slightly different to the one we found - but I think the cause might be very similar - the stripping of a trailling 's' is applied during indexing, but the same is not applied when searching.

 

Hope that gets you somewhere - some of this stuff is fairly recent in my mind (fixing the fix took a bit of tracing through the modules) - there may be more useful stuff I have in my head!

 

Cheers,

John

 

 

From: eprints-tech-bounces@ecs.soton.ac.uk [mailto:eprints-tech-bounces@ecs.soton.ac.uk] On Behalf Of Matthew Brady
Sent: 01 June 2017 06:09
To:
eprints-tech@ecs.soton.ac.uk
Subject: [EP-tech] Plural words in search results

 

Hi All,

 

One of our users came across a problem, when performing some keyword searches…  and assumed it was a case problem, since the all uppercase words in their testing weren’t returning in the result set.

 

After testing, I have a preliminary diagnosis, (we are running 3.3.10 if it makes a difference).

 

It appears the index process is removing the ‘s’ off the end of the word (unless the word is all caps).

When performing a search, the system removes the ‘s’ from the search term, and performs the search… in our case this returns 2 of 3 test records.

 

When I took the last two letters off each eprint’s keywords, and then performed a search, it returned all three records in the results..

 

+----------+----------+---------------+--------------------------+

|<-   details from eprint__rindex   ->|<- eprint.keywords field->|

+----------+----------+---------------+--------------------------+

| eprintid | field    | word          |     keywords             |

+----------+----------+---------------+--------------------------+

|    29533 | keywords | ornithorhynch |     ornithorhynch        |

|    29534 | keywords | ornithorhynch |     Ornithorhynch        |

|    29535 | keywords | ornithorhynch |     ORNITHORHYNCH        |

+----------+----------+---------------+--------------------------+

 

The plural determination holds true for the humble Platypus as well L

 

+----------+----------+-----------------+---------------------------+

| eprintid | field    | word            | keywords                  |

+----------+----------+-----------------+---------------------------+

|    29533 | keywords | ornithorhynchu  | ornithorhynchus, platypus |

|    29533 | keywords | platypu         | ornithorhynchus, platypus |

|    29534 | keywords | ornithorhynchu  | Ornithorhynchus, Platypus |

|    29534 | keywords | platypu         | Ornithorhynchus, Platypus |

|    29535 | keywords | ornithorhynchus | ORNITHORHYNCHUS, PLATYPUS |

|    29535 | keywords | platypus        | ORNITHORHYNCHUS, PLATYPUS |

+----------+----------+-----------------+---------------------------+

 

 

Cheers

 

Matt.

 

 
_____________________________________________________________
This email (including any attached files) is confidential and is for the intended recipient(s) only. If you received this email by mistake, please, as a courtesy, tell the sender, then delete this email.
 
The views and opinions are the originator's and do not necessarily reflect those of the University of Southern Queensland. Although all reasonable precautions were taken to ensure that this email contained no viruses at the time it was sent we accept no liability for any losses arising from its receipt.
 
The University of Southern Queensland is a registered provider of education with the Australian Government.
(CRICOS Institution Code QLD 00244B / NSW 02225M, TEQSA PRV12081 )
 
_____________________________________________________________
This email (including any attached files) is confidential and is for the intended recipient(s) only. If you received this email by mistake, please, as a courtesy, tell the sender, then delete this email.
 
The views and opinions are the originator's and do not necessarily reflect those of the University of Southern Queensland. Although all reasonable precautions were taken to ensure that this email contained no viruses at the time it was sent we accept no liability for any losses arising from its receipt.
 
The University of Southern Queensland is a registered provider of education with the Australian Government.
(CRICOS Institution Code QLD 00244B / NSW 02225M, TEQSA PRV12081 )