EPrints Technical Mailing List Archive

See the EPrints wiki for instructions on how to join this mailing list and related information.

Message: #09573

< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First

Re: [EP-tech] Indexing - cleanup indexed terms after mass deletions

CAUTION: This e-mail originated outside the University of Southampton.

Hi Matt,

On Tue, 30 Jan 2024 at 09:31, Matthew Brady <Matthew.Brady@unisq.edu.au> wrote:
> Hi All,
> Our original repo, houses traditional outputs (Articles, conference papers etc.) as well as Theses…
> We have split the Theses into a dedicated repo, cloning the original system (metadata and files), and then removed the non-theses (search->batch edit->remove all records).
> I have noticed that there are entries in the various database index tables, referring to eprints that are no longer in the system…
> I have run epadmin reindex over ‘<repo> eprint’ and ‘<repo> document’, but the indexed values persist…
> e.g. eprint__index contains a fieldword = ‘title:elephant’ with ids = ‘:12345:’  but there is no eprint 12345 in the system any longer.
> I thought the permanent removal of the non-theses items would have cleaned up the index tables as process occurred?
> Any thoughts appreciated.
> Cheers,
> Matt

In this particular case, is the 'title:elephant' associated with any
of your theses, or _only_ with deleted records? Because if it's the
latter, then the row is orphaned – it has no inward referential links
– so any reindexing task that is built around "foreach(eprint)" rather
than "foreach(tablerow)" won't even see the row in question, so won't
know to clean it up.

We should probably have a look at the remove/delete routines and see
how deep they go into cleaning up index tables, filesystem
directories, view pages, etc. Off the top of my head I don't know at
all, I'm afraid. I assume "not very deep."

For what it's worth, in moments of questionable judgement I have
purged our repository's various _index, _rindex, and _orderval tables
and triggered the appropriate reindexing/reordering tasks manually. It
doesn't seem to have caused any problems after the fact.

  Matthew Kerwin