EPrints Technical Mailing List Archive

Message: #08176


< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First

Re: [EP-tech] EPrints Search - Latest Items


Hi all,

I have been thinking about this and as I was doing some work to provide a case-insensitive ID MetaField, (useful for usernames and email addresses), I thought I would have a stab at writing an effective keywords MetaField [1].  It is still a bit of a work in progress, as trying to shoehorn this into the existing EPrints framework in a way where repositories could change their keywords field from longtext to this new MetaField has been a little tricky.  I am not overly confident of its database query efficiency but I am reasonably confident it is compatible across the currently supported database formats.

The field works based on the premise that there is a separator for terms, allowing for multiple word terms.  The default separator is a comma (,) but this can be set to something else.  I have allowed searching for terms to be case-insensitive.  I think that the number of false positives as a result of this will be outweighed by the benefit of complex terms being matched even if the odd character is wrongly cased, (e.g. ChadOx1 nCoV-19 rather than ChAdOx1 nCoV-19).

As an example. lets take an example where the following keywords are set, using commas as term separators:

covid-19, Coronavirus, SARS-CoV-2, ChAdOx1 nCoV-19

A search for "covid-19" or "Covid-19" would find a results but "covid" would not.  Searches for "Coronavirus,covid-19", "Coronavirus, covid-19" and "Coronavirus,        covid-19", should all find results.  Also, if you searched for "ChAdOx1 nCoV-19,Coronavirus" you would find a result but "ChAdOx1,nCoV-19,Coronavirus" you would not.

Regards

David Newman

[1] https://github.com/eprints/eprints3.4/issues/61

On 30/04/2020 09:55, Christopher Gutteridge via Eprints-tech wrote:

I don't recall if you can reindex individual fields.

On 30/04/2020 09:51, Yuri via Eprints-tech wrote:

Thanks for the pointer, maybe a check against a fixed vocabulary can be enough.

This also mean reindex all the archive. Is it possible to reindex only title and keywords? Full text can be a problem to reindex if you've a lot of pdf, for example.

Il 30/04/20 10:29, Christopher Gutteridge via Eprints-tech ha scritto:

EPrints makes some decisions on what to index. Those can be overridden, if I recall the old magics from the dawn of time.

https://github.com/eprints/eprints/blob/3.3/lib/defaultcfg/cfg.d/indexing.pl

That, by default, uses EPrints word split function https://github.com/eprints/eprints/blob/3.3/perl_lib/EPrints/Index/Tokenizer.pm#L39 which apparently uses the perl regexp library to decide word breaks, but you can write one that does what you want. freetext_seperator_chars seems utterly ignored now.

This is still obeyed
$c->{indexing}->{freetext_min_word_size} = 3;

Which caused some issues for people with Chinese name "Wu".

I would suggest considering keeping it by altering indexing.pl to always index numbers even if they are one or two digits long. Something like this (of course you'd then have to entirely reindex)


        # First approximation is if this word is over or equal
        # to the minimum size set in SiteInfo.
        my $ok = $wordlen >= $c->{indexing}->{freetext_min_word_size};

        if( $word =~ m/^\d+$/ ) {
                    $ok = 1;
        } 

On 30/04/2020 08:27, Yuri via Eprints-tech wrote:

Hi!

 I've found that the virus can be referred also as "SARS COV-2" so maybe you can add also this. But beware that Eprints search has a problem with -, it split the word using it.

Il 27/04/20 17:06, James Kerwin via Eprints-tech ha scritto:
Hello All,

I hope everyone is well in body and mind.

I need some help with the EPrints search function. I have been asked to add a box to the repository homepage that lists the latest coronavirus-related deposits.

I'm hoping to search via keywords for "coronavirus" and "covid-19". I also want to search for either of these terms in titles. To do this I'm currently butchering a copy of cgi/latest_tool.

I can get the keywords part to work using:

$c->{latest_rona_modes} = {
default => { citation => "noauth" },
fplatest => {
citation => "popular", max => 5,
#citation => "result", max => 3,
filters => [
#{ meta_fields => [ "full_text_status","full_text_status" ], value => ("none"||"public") }
{ meta_fields => [ "keywords" ], value => "covid-19"}

This also works with "title" as you would expect.

What I really want is to do a search where the keywords can be "covid-19" OR "coronavirus" as well as including some allowance for adding an:

 "OR title LIKE '%covid-19%' OR title LIKE 'coronavirus' in MYSQL-speak.

Am I able to do this using the EPrints::Search plugin? I've tried reading the codumentation and experimenting with it, but I'm not getting very far.

If it's not possible I can think of a number of bodges for it, but decided it was best to attempt the proper way first.

Thanks,
James

*** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
*** Archive: http://www.eprints.org/tech.php/
*** EPrints community wiki: http://wiki.eprints.org/

*** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
*** Archive: http://www.eprints.org/tech.php/
*** EPrints community wiki: http://wiki.eprints.org/
-- 
Christopher Gutteridge <totl@soton.ac.uk> 
You should read our team blog at http://blog.soton.ac.uk/webteam/

*** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
*** Archive: http://www.eprints.org/tech.php/
*** EPrints community wiki: http://wiki.eprints.org/

*** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
*** Archive: http://www.eprints.org/tech.php/
*** EPrints community wiki: http://wiki.eprints.org/
-- 
Christopher Gutteridge <totl@soton.ac.uk> 
You should read our team blog at http://blog.soton.ac.uk/webteam/

*** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
*** Archive: http://www.eprints.org/tech.php/
*** EPrints community wiki: http://wiki.eprints.org/

Virus-free. www.avg.com