[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[EP-tech] EPrints Search - Latest Items



Hi all,

I have been thinking about this and as I was doing some work to provide 
a case-insensitive ID MetaField, (useful for usernames and email 
addresses), I thought I would have a stab at writing an effective 
keywords MetaField [1].? It is still a bit of a work in progress, as 
trying to shoehorn this into the existing EPrints framework in a way 
where repositories could change their keywords field from longtext to 
this new MetaField has been a little tricky.? I am not overly confident 
of its database query efficiency but I am reasonably confident it is 
compatible across the currently supported database formats.

The field works based on the premise that there is a separator for 
terms, allowing for multiple word terms.? The default separator is a 
comma (,) but this can be set to something else.? I have allowed 
searching for terms to be case-insensitive.? I think that the number of 
false positives as a result of this will be outweighed by the benefit of 
complex terms being matched even if the odd character is wrongly cased, 
(e.g. ChadOx1 nCoV-19 rather than ChAdOx1 nCoV-19).

As an example. lets take an example where the following keywords are 
set, using commas as term separators:

covid-19, Coronavirus, SARS-CoV-2, ChAdOx1 nCoV-19

A search for "covid-19" or "Covid-19" would find a results but "covid" 
would not.? Searches for "Coronavirus,covid-19", "Coronavirus, covid-19" 
and "Coronavirus,??????? covid-19", should all find results.? Also, if 
you searched for "ChAdOx1 nCoV-19,Coronavirus" you would find a result 
but "ChAdOx1,nCoV-19,Coronavirus" you would not.

Regards

David Newman

[1] https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Feprints%2Feprints3.4%2Fissues%2F61&data=01%7C01%7C%7Ce09d442d4b724ac8e5c008d7ffec5ca8%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=34ltQcgdM4f1vflUZNs0ItS7qmmNVEjNChYEI%2FcmtOM%3D&reserved=0

On 30/04/2020 09:55, Christopher Gutteridge via Eprints-tech wrote:
>
> I don't recall if you can reindex individual fields.
>
> On 30/04/2020 09:51, Yuri via Eprints-tech wrote:
>>
>> Thanks for the pointer, maybe a check against a fixed vocabulary can 
>> be enough.
>>
>> This also mean reindex all the archive. Is it possible to reindex 
>> only title and keywords? Full text can be a problem to reindex if 
>> you've a lot of pdf, for example.
>>
>> Il 30/04/20 10:29, Christopher Gutteridge via Eprints-tech ha scritto:
>>>
>>> EPrints makes some decisions on what to index. Those can be 
>>> overridden, if I recall the old magics from the dawn of time.
>>>
>>> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Feprints%2Feprints%2Fblob%2F3.3%2Flib%2Fdefaultcfg%2Fcfg.d%2Findexing.pl&data=01%7C01%7C%7Ce09d442d4b724ac8e5c008d7ffec5ca8%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=K7r00vnos9nhSjYHA7c%2FEv32np5%2F3wiT1IluLYSHZ9E%3D&reserved=0
>>>
>>> That, by default, uses EPrints word split function 
>>> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Feprints%2Feprints%2Fblob%2F3.3%2Fperl_lib%2FEPrints%2FIndex%2FTokenizer.pm%23L39&data=01%7C01%7C%7Ce09d442d4b724ac8e5c008d7ffec5ca8%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=PGG0jf7VQrst2a3OBJ19xMNMb6Lcj2IS3xMhYYCEiqc%3D&reserved=0 
>>> which apparently uses the perl regexp library to decide word breaks, 
>>> but you can write one that does what you want. 
>>> freetext_seperator_chars seems utterly ignored now.
>>>
>>> This is still obeyed
>>> $c->{indexing}->{freetext_min_word_size} = 3;
>>>
>>> Which caused some issues for people with Chinese name "Wu".
>>>
>>> I would suggest considering keeping it by altering indexing.pl to 
>>> always index numbers even if they are one or two digits long. 
>>> Something like this (of course you'd then have to entirely reindex)
>>>
>>>
>>> ??? ??? # First approximation is if this word is over or equal
>>> ??? ??? # to the minimum size set in SiteInfo.
>>> ??? ??? my $ok = $wordlen >= $c->{indexing}->{freetext_min_word_size};
>>>
>>> ? ??? ? if( $word =~ m/^\d+$/ ) {
>>> ? ? ? ? ? ? ? ? ? ? $ok = 1;
>>> ? ? ? ? }
>>>
>>> On 30/04/2020 08:27, Yuri via Eprints-tech wrote:
>>>>
>>>> Hi!
>>>>
>>>> ?I've found that the virus can be referred also as "SARS COV-2" so 
>>>> maybe you can add also this. But beware that Eprints search has a 
>>>> problem with -, it split the word using it.
>>>>
>>>> Il 27/04/20 17:06, James Kerwin via Eprints-tech ha scritto:
>>>>> Hello All,
>>>>>
>>>>> I hope everyone is well in body and mind.
>>>>>
>>>>> I need some help with the EPrints search function. I have been 
>>>>> asked to add a box to the repository homepage that lists the 
>>>>> latest coronavirus-related deposits.
>>>>>
>>>>> I'm hoping to search via keywords for "coronavirus" and 
>>>>> "covid-19". I also want to search for either of these terms in 
>>>>> titles. To do this I'm currently butchering?a copy of cgi/latest_tool.
>>>>>
>>>>> I can get the keywords part to work using:
>>>>>
>>>>>             $c->{latest_rona_modes} = {
>>>>>
>>>>>             default => { citation => "noauth" },
>>>>>
>>>>>             fplatest => {
>>>>>
>>>>>             citation => "popular", max => 5,
>>>>>
>>>>>             #citation => "result", max => 3,
>>>>>
>>>>>             filters => [
>>>>>
>>>>>             #{ meta_fields => [
>>>>>             "full_text_status","full_text_status" ], value =>
>>>>>             ("none"||"public") }
>>>>>
>>>>>             { meta_fields => [ "keywords" ], value => "covid-19"}
>>>>>
>>>>> This also works with "title" as you would expect.
>>>>>
>>>>> What I really want is to do a search where the keywords can be 
>>>>> "covid-19" OR "coronavirus" as well as including some allowance 
>>>>> for adding an:
>>>>>
>>>>> ?"OR title LIKE '%covid-19%' OR title LIKE 'coronavirus' in 
>>>>> MYSQL-speak.
>>>>>
>>>>> Am I able to do this using the?EPrints::Search plugin? I've tried 
>>>>> reading the codumentation and experimenting with it, but I'm not 
>>>>> getting very far.
>>>>>
>>>>> If it's not possible I can think of a number of bodges for it, but 
>>>>> decided it was best to attempt the proper way first.
>>>>>
>>>>> Thanks,
>>>>> James
>>>>>
>>>>> *** Options:http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
>>>>> *** Archive:https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.eprints.org%2Ftech.php%2F&data=01%7C01%7C%7Ce09d442d4b724ac8e5c008d7ffec5ca8%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=jr3cGu8V6sg7VNlFFiJkmGf0jJ6IetHSskoJgTj40Rk%3D&reserved=0
>>>>> *** EPrints community wiki:https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwiki.eprints.org%2F&data=01%7C01%7C%7Ce09d442d4b724ac8e5c008d7ffec5ca8%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=iJsVjd0vffn4SddtSkFtxG5mKs7i1BW3vi7Hn1ZoQvI%3D&reserved=0
>>>>
>>>> *** Options:http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
>>>> *** Archive:https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.eprints.org%2Ftech.php%2F&data=01%7C01%7C%7Ce09d442d4b724ac8e5c008d7ffec5ca8%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=jr3cGu8V6sg7VNlFFiJkmGf0jJ6IetHSskoJgTj40Rk%3D&reserved=0
>>>> *** EPrints community wiki:https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwiki.eprints.org%2F&data=01%7C01%7C%7Ce09d442d4b724ac8e5c008d7ffec5ca8%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=iJsVjd0vffn4SddtSkFtxG5mKs7i1BW3vi7Hn1ZoQvI%3D&reserved=0
>>> -- 
>>> Christopher Gutteridge<totl at soton.ac.uk>  
>>> You should read our team blog athttp://blog.soton.ac.uk/webteam/
>>>
>>> *** Options:http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
>>> *** Archive:https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.eprints.org%2Ftech.php%2F&amp;data=01%7C01%7C%7Ce09d442d4b724ac8e5c008d7ffec5ca8%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&amp;sdata=jr3cGu8V6sg7VNlFFiJkmGf0jJ6IetHSskoJgTj40Rk%3D&amp;reserved=0
>>> *** EPrints community wiki:https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwiki.eprints.org%2F&amp;data=01%7C01%7C%7Ce09d442d4b724ac8e5c008d7ffec5ca8%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&amp;sdata=iJsVjd0vffn4SddtSkFtxG5mKs7i1BW3vi7Hn1ZoQvI%3D&amp;reserved=0
>>
>> *** Options:http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
>> *** Archive:https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.eprints.org%2Ftech.php%2F&amp;data=01%7C01%7C%7Ce09d442d4b724ac8e5c008d7ffec5ca8%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&amp;sdata=jr3cGu8V6sg7VNlFFiJkmGf0jJ6IetHSskoJgTj40Rk%3D&amp;reserved=0
>> *** EPrints community wiki:https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwiki.eprints.org%2F&amp;data=01%7C01%7C%7Ce09d442d4b724ac8e5c008d7ffec5ca8%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&amp;sdata=iJsVjd0vffn4SddtSkFtxG5mKs7i1BW3vi7Hn1ZoQvI%3D&amp;reserved=0
> -- 
> Christopher Gutteridge<totl at soton.ac.uk>  
> You should read our team blog athttp://blog.soton.ac.uk/webteam/
>
> *** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
> *** Archive: https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.eprints.org%2Ftech.php%2F&amp;data=01%7C01%7C%7Ce09d442d4b724ac8e5c008d7ffec5ca8%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&amp;sdata=jr3cGu8V6sg7VNlFFiJkmGf0jJ6IetHSskoJgTj40Rk%3D&amp;reserved=0
> *** EPrints community wiki: https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwiki.eprints.org%2F&amp;data=01%7C01%7C%7Ce09d442d4b724ac8e5c008d7ffec5ca8%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&amp;sdata=iJsVjd0vffn4SddtSkFtxG5mKs7i1BW3vi7Hn1ZoQvI%3D&amp;reserved=0


-- 
This email has been checked for viruses by AVG.
https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.avg.com%2F&amp;data=01%7C01%7C%7Ce09d442d4b724ac8e5c008d7ffec5ca8%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&amp;sdata=wLwXn5ejhCTu0jQ%2Far%2FF3AOzVJYVKDwxe7ZRci9gxVI%3D&amp;reserved=0
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ecs.soton.ac.uk/pipermail/eprints-tech/attachments/20200524/5ee61a2e/attachment-0001.html