EPrints Technical Mailing List Archive

See the EPrints wiki for instructions on how to join this mailing list and related information.

Message: #09578


< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First

Re: [EP-tech] Indexing - cleanup indexed terms after mass deletions


Hi Matt,

Batch edit is sometimes a law to itself.  I think the following script will allow you to delete indexes from any dataset:

#!/usr/bin/perl -w

############################################################################
#
# Remove Data Object from Index and Ordervalues
#
# Usage: ./remove_index <ARCHIVE_ID> <DATASET_ID> <DATAOBJ_ID>
#
############################################################################

use FindBin;
use lib "$FindBin::Bin/../../../perl_lib";

use EPrints;

my $repoid = $ARGV[0];
my $session = new EPrints::Session( 1 , $repoid , 1 );

my $datasetid = $ARGV[1];
my $dataset = $session->dataset( $datasetid ) ;

my $dataobjid = $ARGV[2];

EPrints::Index::remove_all( $session, $dataset, $dataobjid );
EPrints::Index::delete_ordervalues( $session, $dataset, $dataobjid );

$session->terminate;


This script assumes it has been added to the bin directory of your archive, if it is elsewhere you may need to update FindBin.  Currently the script can only remove the index from one data object at a time but it could be easily modified to iterate through a list.  EPrints::Index::remove_all removes all data object fields indexed in the DATASET__rindex and DATASET__index_grep tables.  EPrints::Index::delete_ordervalues removes records for the data object in the DATASET__ordervalues_LANG tables. Thsi script will not touch the DATASET__index table but more recently (at least since 3.4, if not earlier) this table has not been used.  I would advise you stop the EPrints indexer before running this script.  Although, in theory if you have InnoDB tables it should be able to cope with potentially multiple processes modifying index tables.

Regards

David Newman


On 30/01/2024 3:46 am, Matthew Kerwin wrote:
CAUTION: This e-mail originated outside the University of Southampton.

CAUTION: This e-mail originated outside the University of Southampton.

Hi Matt,

On Tue, 30 Jan 2024 at 09:31, Matthew Brady <Matthew.Brady@unisq.edu.au> wrote:
Hi All,

Our original repo, houses traditional outputs (Articles, conference papers etc.) as well as Theses…
We have split the Theses into a dedicated repo, cloning the original system (metadata and files), and then removed the non-theses (search->batch edit->remove all records).

I have noticed that there are entries in the various database index tables, referring to eprints that are no longer in the system…
I have run epadmin reindex over ‘<repo> eprint’ and ‘<repo> document’, but the indexed values persist…

e.g. eprint__index contains a fieldword = ‘title:elephant’ with ids = ‘:12345:’  but there is no eprint 12345 in the system any longer.

I thought the permanent removal of the non-theses items would have cleaned up the index tables as process occurred?

Any thoughts appreciated.

Cheers,
Matt

In this particular case, is the 'title:elephant' associated with any
of your theses, or _only_ with deleted records? Because if it's the
latter, then the row is orphaned – it has no inward referential links
– so any reindexing task that is built around "foreach(eprint)" rather
than "foreach(tablerow)" won't even see the row in question, so won't
know to clean it up.

We should probably have a look at the remove/delete routines and see
how deep they go into cleaning up index tables, filesystem
directories, view pages, etc. Off the top of my head I don't know at
all, I'm afraid. I assume "not very deep."

For what it's worth, in moments of questionable judgement I have
purged our repository's various _index, _rindex, and _orderval tables
and triggered the appropriate reindexing/reordering tasks manually. It
doesn't seem to have caused any problems after the fact.

Cheers
--
  Matthew Kerwin
  https://eur03.safelinks.protection.outlook.com/?url="">

*** Options: https://wiki.eprints.org/w/Eprints-tech_Mailing_List
*** Archive: https://www.eprints.org/tech.php/
*** EPrints community wiki: https://wiki.eprints.org/