[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[EP-tech] Eprints-tech Digest, Vol 148, Issue 45



CAUTION: This e-mail originated outside the University of Southampton.
Martin thanks for your feedback, this workflow includes a thesaurus https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbuildvoc.co.uk%2Fbv%2Fen%2F&data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C44027a98ef3b4cae519308d8c1d706bd%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637472476561599858%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=fPILke9hkHf5lSHsU91vyNceub6i32UdDxSbHhpo9jM%3D&reserved=0 and for NLP  https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FNatLibFi%2FAnnif&data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C44027a98ef3b4cae519308d8c1d706bd%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637472476561599858%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=qYAfEX%2BUwEWU7cHfA51NI5a70uELjXkdU7pcOPBoE%2FU%3D&reserved=0 which processes the abstract in eprints and returns the keywords.

Need to have the each keyword in a individual fields, so I can tell the indexer these fields are ?phrase? with white space

Any ideas on script to create 10 fields for uncontrolled keywords?

Best Regards,
Phil Stacey 07792661738<tel:07792661738>
building regulations guidance for fire safety<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Feprints.buildvoc.co.uk%2F&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C44027a98ef3b4cae519308d8c1d706bd%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637472476561599858%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=%2FFQc0Ut6pp3BZcGVz14dxEkVGnMh3HGAZxk9vW2qs7M%3D&amp;reserved=0>

On 25 Jan 2021, at 11:49, eprints-tech-request at ecs.soton.ac.uk wrote:

?Send Eprints-tech mailing list submissions to
   eprints-tech at ecs.soton.ac.uk

To subscribe or unsubscribe via the World Wide Web, visit
   http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
or, via email, send a message with subject or body 'help' to
   eprints-tech-request at ecs.soton.ac.uk

You can reach the person managing the list at
   eprints-tech-owner at ecs.soton.ac.uk

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Eprints-tech digest..."


Today's Topics:

  1. Antwort: Re:  Help indexing phrases (martin.braendle at uzh.ch)


----------------------------------------------------------------------

Message: 1
Date: Mon, 25 Jan 2021 12:49:00 +0100
From: <martin.braendle at uzh.ch>
Subject: [EP-tech] Antwort: Re:  Help indexing phrases
To: <eprints-tech at ecs.soton.ac.uk>, David R Newman
   <drn at ecs.soton.ac.uk>
Message-ID:
   <OF0D573B1B.0648AFE6-ONC1258668.0040E951-C1258668.0040E953 at lotus.uzh.ch>

Content-Type: text/plain; charset="utf-8"

CAUTION: This e-mail originated outside the University of Southampton.
Hi Phil,

in the final end, reverse indexes of standard search engines are single term based. This is a basic principle.

Xapian is pretty basic in this matter - more advanced search engines such as ElasticSearch offer field types such as "keyword" that allow to store multi-term expression - in the end however, the Lucene backend also will store single terms in its reverse indexes.

Still, there is the difficulty how to identify a multi-term expression within a bulk of text - this is usually the field of Natural Language Processing, and special tools and thesauri are needed.

Kind regards,

Martin


-----<eprints-tech-bounces at ecs.soton.ac.uk<mailto:eprints-tech-bounces at ecs.soton.ac.uk>> schrieb: -----
An: <eprints-tech at ecs.soton.ac.uk<mailto:eprints-tech at ecs.soton.ac.uk>>, "Phil Stacey" <phil at buildvoc.co.uk<mailto:phil at buildvoc.co.uk>>
Von: "David R Newman via Eprints-tech"
Gesendet von:
Datum: 25.01.2021 10:39
Betreff: Re: [EP-tech] Help indexing phrases


Hi Phil,

Unfortunately, I don't think this is possible.  I think you would need to create a new field that is an id multiple field and use this.  You could probably write a script to map from the uncontrolled keywords field into this new multiple id field.  However, even with this new field I am not sure how well Xapian would index these as individual multi-word terms.  Advanced search for this field should work as you require.  In 3.4.2 I introduced the Idci MetaField that is basically the same as the Id MetaField but that matches case-insensitively, this is useful for mathcing things like email addresses and usernames, where case does not usually make a functional difference.

I have been thinking how best to implement a keywords fields that is more effective across simple, advanced and faceted search, particularly for multi-word terms.  I have yet to conclude on a solution, as I need to better understand how Xapian indexing works to see if it can be setup to allow EPrints to effectively index multiple-word terms.

Regards

David Newman

On 25/01/2021 07:06, Phil Stacey via Eprints-tech wrote:
CAUTION: This e-mail originated outside the University of Southampton.
Using uncontrolled keywords field which has phrases separated by commas, like to index the whole phrase.

For example :-
evacuation lift, part b - fire safety, b5 access and facilities for the fire
service, fire risk assessment, residual risk, building safety, b4 external
fire spread, means of escape, principal works, health & safety strategy
https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Feprints.buildvoc.co.uk%2Fid%2Feprint%2F865%2F&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C44027a98ef3b4cae519308d8c1d706bd%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637472476561609854%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=wyiIftf596Q0lULUsIbOIRDXSIdbcs8c%2FQHUk9qfr0I%3D&amp;reserved=0<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Feprints.buildvoc.co.uk%2Fid%2Feprint%2F865%2F&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C44027a98ef3b4cae519308d8c1d706bd%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637472476561609854%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=wyiIftf596Q0lULUsIbOIRDXSIdbcs8c%2FQHUk9qfr0I%3D&amp;reserved=0>

Question how do I configure xapian or indexing.pl to index the whole phrase instead of the individual terms for example fire, safety, or building

Best Regards,
Phil Stacey
building regulations guidance for fire safety<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Feprints.buildvoc.co.uk%2F&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C44027a98ef3b4cae519308d8c1d706bd%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637472476561609854%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=fkrln8rGUKQktYO5yYGKalUkjCAOQ3xXfcvtTb7rZAk%3D&amp;reserved=0>


*** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
*** Archive: https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.eprints.org%2Ftech.php%2F&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C44027a98ef3b4cae519308d8c1d706bd%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637472476561609854%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=N6tMliFpLQhaDsV%2BCMh45zE%2Fcdwfjwz%2FSmr54%2B%2Fid10%3D&amp;reserved=0<https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.eprints.org%2Ftech.php%2F&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C44027a98ef3b4cae519308d8c1d706bd%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637472476561609854%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=N6tMliFpLQhaDsV%2BCMh45zE%2Fcdwfjwz%2FSmr54%2B%2Fid10%3D&amp;reserved=0>
*** EPrints community wiki: https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwiki.eprints.org%2F&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C44027a98ef3b4cae519308d8c1d706bd%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637472476561609854%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=q9toQacGKkid4lAmnTcONkRPXTGb%2FdvIreLYhfmaR7c%3D&amp;reserved=0<https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwiki.eprints.org%2F&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C44027a98ef3b4cae519308d8c1d706bd%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637472476561619844%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=NBkqNa6mIh8%2FTJlGKnGpwaktKxp1AJVvbLr%2FOiGqZ8g%3D&amp;reserved=0>

[https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fipmcdn.avast.com%2Fimages%2Ficons%2Ficon-envelope-tick-green-avg-v1.png&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C44027a98ef3b4cae519308d8c1d706bd%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637472476561619844%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=lA6%2B5jaXnIsLZBI2bPzxNQd0Clzw2DPmwBbY1Q8Rfns%3D&amp;reserved=0]<https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.avg.com%2Femail-signature%3Futm_medium%3Demail%26utm_source%3Dlink%26utm_campaign%3Dsig-email%26utm_content%3Demailclient&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C44027a98ef3b4cae519308d8c1d706bd%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637472476561619844%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=FRC9VL%2FQ4EBvTvNd9fKSMa8HfQDwfWj0pciOMK%2FEDgg%3D&amp;reserved=0>        Virus-free. https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.avg.com%2F&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C44027a98ef3b4cae519308d8c1d706bd%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637472476561619844%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=IpIWkMp6vzazw9T8sRu8gi23FcCU018OdIkU3sXrUUw%3D&amp;reserved=0<https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.avg.com%2Femail-signature%3Futm_medium%3Demail%26utm_source%3Dlink%26utm_campaign%3Dsig-email%26utm_content%3Demailclient&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C44027a98ef3b4cae519308d8c1d706bd%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637472476561619844%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=FRC9VL%2FQ4EBvTvNd9fKSMa8HfQDwfWj0pciOMK%2FEDgg%3D&amp;reserved=0>
*** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
*** Archive: https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.eprints.org%2Ftech.php%2F&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C44027a98ef3b4cae519308d8c1d706bd%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637472476561619844%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=JoOIS0oixPvt4MuIub2MSxZMwwGn04kw38B1Lp749S8%3D&amp;reserved=0
*** EPrints community wiki: https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwiki.eprints.org%2F&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C44027a98ef3b4cae519308d8c1d706bd%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637472476561629836%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=U%2B15jhigmeQIWgKRtVncnYwSARPMLRN7fo5epJa3AaM%3D&amp;reserved=0
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ecs.soton.ac.uk/pipermail/eprints-tech/attachments/20210125/251a1fed/attachment.html

------------------------------

_______________________________________________
Eprints-tech mailing list
Eprints-tech at ecs.soton.ac.uk
http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech


End of Eprints-tech Digest, Vol 148, Issue 45
*********************************************
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ecs.soton.ac.uk/pipermail/eprints-tech/attachments/20210126/23cda60d/attachment-0001.html