[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[EP-tech] EPrints Search - Latest Items
- Subject: [EP-tech] EPrints Search - Latest Items
- From: drn at ecs.soton.ac.uk (David R Newman)
- Date: Sun, 24 May 2020 15:11:31 +0100
- In-reply-to: <EMEW3|7118a8284418cdb212668ef50510ba4dw3Y9yG14eprints-tech-bounces|ecs.soton.ac.uk|0df15eb9-253a-9d2e-6398-46d74f8f113d@soton.ac.uk>
- References: <CAKkNZ9Bp5Hpsxb-G9oKRtnn6-pHfcG10ob8mfYBkQ-KFcAF6Sw@mail.gmail.com> <d32bf79e-5ac1-3c16-f6d0-c890b0b95c0d@alfa.it> <EMEW3|b966302d0df9e000a031a1f8b6e8872cw3Y8YW14eprints-tech-bounces|ecs.soton.ac.uk|d32bf79e-5ac1-3c16-f6d0-c890b0b95c0d@alfa.it> <53088265-0f14-fff8-e573-a29509f22563@soton.ac.uk> <53088265-0f14-fff8-e573-a29509f22563@soton.ac.uk> <EMEW3|873b453e06baf3567d21799648f3bbcfw3Y9Va14eprints-tech-bounces|ecs.soton.ac.uk|53088265-0f14-fff8-e573-a29509f22563@soton.ac.uk> <bfc7fde6-2d35-2d22-ace0-bc3e24da89e2@alfa.it> <bfc7fde6-2d35-2d22-ace0-bc3e24da89e2@alfa.it> <EMEW3|6d8dbf3fb7da9a54f26d6558f796a2f9w3Y9rI14eprints-tech-bounces|ecs.soton.ac.uk|bfc7fde6-2d35-2d22-ace0-bc3e24da89e2@alfa.it> <0df15eb9-253a-9d2e-6398-46d74f8f113d@soton.ac.uk> <0df15eb9-253a-9d2e-6398-46d74f8f113d@soton.ac.uk> <EMEW3|7118a8284418cdb212668ef50510ba4dw3Y9yG14eprints-tech-bounces|ecs.soton.ac.uk|0df15eb9-253a-9d2e-6398-46d74f8f113d@soton.ac.uk> <3150cfd3-cd68-ec50-4457-19feb77aa510@ecs.soton.ac.uk>
Hi all,
I have been thinking about this and as I was doing some work to provide
a case-insensitive ID MetaField, (useful for usernames and email
addresses), I thought I would have a stab at writing an effective
keywords MetaField [1].? It is still a bit of a work in progress, as
trying to shoehorn this into the existing EPrints framework in a way
where repositories could change their keywords field from longtext to
this new MetaField has been a little tricky.? I am not overly confident
of its database query efficiency but I am reasonably confident it is
compatible across the currently supported database formats.
The field works based on the premise that there is a separator for
terms, allowing for multiple word terms.? The default separator is a
comma (,) but this can be set to something else.? I have allowed
searching for terms to be case-insensitive.? I think that the number of
false positives as a result of this will be outweighed by the benefit of
complex terms being matched even if the odd character is wrongly cased,
(e.g. ChadOx1 nCoV-19 rather than ChAdOx1 nCoV-19).
As an example. lets take an example where the following keywords are
set, using commas as term separators:
covid-19, Coronavirus, SARS-CoV-2, ChAdOx1 nCoV-19
A search for "covid-19" or "Covid-19" would find a results but "covid"
would not.? Searches for "Coronavirus,covid-19", "Coronavirus, covid-19"
and "Coronavirus,??????? covid-19", should all find results.? Also, if
you searched for "ChAdOx1 nCoV-19,Coronavirus" you would find a result
but "ChAdOx1,nCoV-19,Coronavirus" you would not.
Regards
David Newman
[1] https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Feprints%2Feprints3.4%2Fissues%2F61&data=01%7C01%7C%7Ce09d442d4b724ac8e5c008d7ffec5ca8%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=34ltQcgdM4f1vflUZNs0ItS7qmmNVEjNChYEI%2FcmtOM%3D&reserved=0
On 30/04/2020 09:55, Christopher Gutteridge via Eprints-tech wrote:
>
> I don't recall if you can reindex individual fields.
>
> On 30/04/2020 09:51, Yuri via Eprints-tech wrote:
>>
>> Thanks for the pointer, maybe a check against a fixed vocabulary can
>> be enough.
>>
>> This also mean reindex all the archive. Is it possible to reindex
>> only title and keywords? Full text can be a problem to reindex if
>> you've a lot of pdf, for example.
>>
>> Il 30/04/20 10:29, Christopher Gutteridge via Eprints-tech ha scritto:
>>>
>>> EPrints makes some decisions on what to index. Those can be
>>> overridden, if I recall the old magics from the dawn of time.
>>>
>>> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Feprints%2Feprints%2Fblob%2F3.3%2Flib%2Fdefaultcfg%2Fcfg.d%2Findexing.pl&data=01%7C01%7C%7Ce09d442d4b724ac8e5c008d7ffec5ca8%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=K7r00vnos9nhSjYHA7c%2FEv32np5%2F3wiT1IluLYSHZ9E%3D&reserved=0
>>>
>>> That, by default, uses EPrints word split function
>>> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Feprints%2Feprints%2Fblob%2F3.3%2Fperl_lib%2FEPrints%2FIndex%2FTokenizer.pm%23L39&data=01%7C01%7C%7Ce09d442d4b724ac8e5c008d7ffec5ca8%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=PGG0jf7VQrst2a3OBJ19xMNMb6Lcj2IS3xMhYYCEiqc%3D&reserved=0
>>> which apparently uses the perl regexp library to decide word breaks,
>>> but you can write one that does what you want.
>>> freetext_seperator_chars seems utterly ignored now.
>>>
>>> This is still obeyed
>>> $c->{indexing}->{freetext_min_word_size} = 3;
>>>
>>> Which caused some issues for people with Chinese name "Wu".
>>>
>>> I would suggest considering keeping it by altering indexing.pl to
>>> always index numbers even if they are one or two digits long.
>>> Something like this (of course you'd then have to entirely reindex)
>>>
>>>
>>> ??? ??? # First approximation is if this word is over or equal
>>> ??? ??? # to the minimum size set in SiteInfo.
>>> ??? ??? my $ok = $wordlen >= $c->{indexing}->{freetext_min_word_size};
>>>
>>> ? ??? ? if( $word =~ m/^\d+$/ ) {
>>> ? ? ? ? ? ? ? ? ? ? $ok = 1;
>>> ? ? ? ? }
>>>
>>> On 30/04/2020 08:27, Yuri via Eprints-tech wrote:
>>>>
>>>> Hi!
>>>>
>>>> ?I've found that the virus can be referred also as "SARS COV-2" so
>>>> maybe you can add also this. But beware that Eprints search has a
>>>> problem with -, it split the word using it.
>>>>
>>>> Il 27/04/20 17:06, James Kerwin via Eprints-tech ha scritto:
>>>>> Hello All,
>>>>>
>>>>> I hope everyone is well in body and mind.
>>>>>
>>>>> I need some help with the EPrints search function. I have been
>>>>> asked to add a box to the repository homepage that lists the
>>>>> latest coronavirus-related deposits.
>>>>>
>>>>> I'm hoping to search via keywords for "coronavirus" and
>>>>> "covid-19". I also want to search for either of these terms in
>>>>> titles. To do this I'm currently butchering?a copy of cgi/latest_tool.
>>>>>
>>>>> I can get the keywords part to work using:
>>>>>
>>>>> $c->{latest_rona_modes} = {
>>>>>
>>>>> default => { citation => "noauth" },
>>>>>
>>>>> fplatest => {
>>>>>
>>>>> citation => "popular", max => 5,
>>>>>
>>>>> #citation => "result", max => 3,
>>>>>
>>>>> filters => [
>>>>>
>>>>> #{ meta_fields => [
>>>>> "full_text_status","full_text_status" ], value =>
>>>>> ("none"||"public") }
>>>>>
>>>>> { meta_fields => [ "keywords" ], value => "covid-19"}
>>>>>
>>>>> This also works with "title" as you would expect.
>>>>>
>>>>> What I really want is to do a search where the keywords can be
>>>>> "covid-19" OR "coronavirus" as well as including some allowance
>>>>> for adding an:
>>>>>
>>>>> ?"OR title LIKE '%covid-19%' OR title LIKE 'coronavirus' in
>>>>> MYSQL-speak.
>>>>>
>>>>> Am I able to do this using the?EPrints::Search plugin? I've tried
>>>>> reading the codumentation and experimenting with it, but I'm not
>>>>> getting very far.
>>>>>
>>>>> If it's not possible I can think of a number of bodges for it, but
>>>>> decided it was best to attempt the proper way first.
>>>>>
>>>>> Thanks,
>>>>> James
>>>>>
>>>>> *** Options:http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
>>>>> *** Archive:https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.eprints.org%2Ftech.php%2F&data=01%7C01%7C%7Ce09d442d4b724ac8e5c008d7ffec5ca8%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=jr3cGu8V6sg7VNlFFiJkmGf0jJ6IetHSskoJgTj40Rk%3D&reserved=0
>>>>> *** EPrints community wiki:https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwiki.eprints.org%2F&data=01%7C01%7C%7Ce09d442d4b724ac8e5c008d7ffec5ca8%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=iJsVjd0vffn4SddtSkFtxG5mKs7i1BW3vi7Hn1ZoQvI%3D&reserved=0
>>>>
>>>> *** Options:http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
>>>> *** Archive:https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.eprints.org%2Ftech.php%2F&data=01%7C01%7C%7Ce09d442d4b724ac8e5c008d7ffec5ca8%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=jr3cGu8V6sg7VNlFFiJkmGf0jJ6IetHSskoJgTj40Rk%3D&reserved=0
>>>> *** EPrints community wiki:https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwiki.eprints.org%2F&data=01%7C01%7C%7Ce09d442d4b724ac8e5c008d7ffec5ca8%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=iJsVjd0vffn4SddtSkFtxG5mKs7i1BW3vi7Hn1ZoQvI%3D&reserved=0
>>> --
>>> Christopher Gutteridge<totl at soton.ac.uk>
>>> You should read our team blog athttp://blog.soton.ac.uk/webteam/
>>>
>>> *** Options:http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
>>> *** Archive:https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.eprints.org%2Ftech.php%2F&data=01%7C01%7C%7Ce09d442d4b724ac8e5c008d7ffec5ca8%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=jr3cGu8V6sg7VNlFFiJkmGf0jJ6IetHSskoJgTj40Rk%3D&reserved=0
>>> *** EPrints community wiki:https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwiki.eprints.org%2F&data=01%7C01%7C%7Ce09d442d4b724ac8e5c008d7ffec5ca8%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=iJsVjd0vffn4SddtSkFtxG5mKs7i1BW3vi7Hn1ZoQvI%3D&reserved=0
>>
>> *** Options:http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
>> *** Archive:https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.eprints.org%2Ftech.php%2F&data=01%7C01%7C%7Ce09d442d4b724ac8e5c008d7ffec5ca8%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=jr3cGu8V6sg7VNlFFiJkmGf0jJ6IetHSskoJgTj40Rk%3D&reserved=0
>> *** EPrints community wiki:https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwiki.eprints.org%2F&data=01%7C01%7C%7Ce09d442d4b724ac8e5c008d7ffec5ca8%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=iJsVjd0vffn4SddtSkFtxG5mKs7i1BW3vi7Hn1ZoQvI%3D&reserved=0
> --
> Christopher Gutteridge<totl at soton.ac.uk>
> You should read our team blog athttp://blog.soton.ac.uk/webteam/
>
> *** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
> *** Archive: https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.eprints.org%2Ftech.php%2F&data=01%7C01%7C%7Ce09d442d4b724ac8e5c008d7ffec5ca8%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=jr3cGu8V6sg7VNlFFiJkmGf0jJ6IetHSskoJgTj40Rk%3D&reserved=0
> *** EPrints community wiki: https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwiki.eprints.org%2F&data=01%7C01%7C%7Ce09d442d4b724ac8e5c008d7ffec5ca8%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=iJsVjd0vffn4SddtSkFtxG5mKs7i1BW3vi7Hn1ZoQvI%3D&reserved=0
--
This email has been checked for viruses by AVG.
https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.avg.com%2F&data=01%7C01%7C%7Ce09d442d4b724ac8e5c008d7ffec5ca8%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=wLwXn5ejhCTu0jQ%2Far%2FF3AOzVJYVKDwxe7ZRci9gxVI%3D&reserved=0
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ecs.soton.ac.uk/pipermail/eprints-tech/attachments/20200524/5ee61a2e/attachment-0001.html