EPrints Technical Mailing List Archive

See the EPrints wiki for instructions on how to join this mailing list and related information.

Message: #09573


< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First

Re: [EP-tech] Indexing - cleanup indexed terms after mass deletions


CAUTION: This e-mail originated outside the University of Southampton.

Hi Matt,

On Tue, 30 Jan 2024 at 09:31, Matthew Brady <Matthew.Brady@unisq.edu.au> wrote:
>
> Hi All,
>
> Our original repo, houses traditional outputs (Articles, conference papers etc.) as well as Theses…
> We have split the Theses into a dedicated repo, cloning the original system (metadata and files), and then removed the non-theses (search->batch edit->remove all records).
>
> I have noticed that there are entries in the various database index tables, referring to eprints that are no longer in the system…
> I have run epadmin reindex over ‘<repo> eprint’ and ‘<repo> document’, but the indexed values persist…
>
> e.g. eprint__index contains a fieldword = ‘title:elephant’ with ids = ‘:12345:’  but there is no eprint 12345 in the system any longer.
>
> I thought the permanent removal of the non-theses items would have cleaned up the index tables as process occurred?
>
> Any thoughts appreciated.
>
> Cheers,
> Matt
>

In this particular case, is the 'title:elephant' associated with any
of your theses, or _only_ with deleted records? Because if it's the
latter, then the row is orphaned – it has no inward referential links
– so any reindexing task that is built around "foreach(eprint)" rather
than "foreach(tablerow)" won't even see the row in question, so won't
know to clean it up.

We should probably have a look at the remove/delete routines and see
how deep they go into cleaning up index tables, filesystem
directories, view pages, etc. Off the top of my head I don't know at
all, I'm afraid. I assume "not very deep."

For what it's worth, in moments of questionable judgement I have
purged our repository's various _index, _rindex, and _orderval tables
and triggered the appropriate reindexing/reordering tasks manually. It
doesn't seem to have caused any problems after the fact.

Cheers
--
  Matthew Kerwin
  https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmatthew.kerwin.net.au%2F&data=05%7C02%7Ceprints-tech%40ecs.soton.ac.uk%7C5f9cf25386cd4c452baf08dc21461d9a%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C638421832634796844%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C60000%7C%7C%7C&sdata=SY8j7WlOYPAq4B9ccrkdSHsW9q2qPYbih8co7C53s0M%3D&reserved=0