[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[EP-tech] Extending EPrints with pull-based import plug ins


This is a bit of an in-depth query, so I'm not sure whether I'm directing it to the right place. If I'm not, apologies in advance.

I'm currently working on an extension to EPrints that will implement pull-based imports. The basic idea is that rather than have researchers at our institution have to manually search for documents they have uploaded to external repositories such as Web of Science or SciVerse, this should be performed automatically by the repository, alerting them when new documents have been found and allowing them to select which to import.

I would implement this with a few separate components:

1. New kind of plug in for pull-based imports
    The basic interface will take an EPrints user, find associated documents on an external source, convert them into EPrints, and return them to the calling script.* It should be easy to extend the system with new external sources this way.
2. Script / daemon for performing the imports.
    I'm initially going to implement this as a cron job that runs nightly, iterating through every user on the repository, and using each import component to build a list of prints. This will be checked against the local cache of imported prints to eliminate duplicates and also a list of prints that users on the system have 'deleted' - i.e. said they do not want to import or see any more.
3. Screen plug in for viewing outstanding imports. 
   This would work like the work area, displaying a paginated list of EPrints available for import into the work area. Each one would have a button to delete it permanently or import it, and also checkboxes, so multiple imports / deletes could be performed. There will also be a select element for filtering results by source. There will also be a screen plug in to actually view the import's details, which will work exactly the same as the current view plug in.

I also plan to extend this with a MePrints widget and e-mail notification, but I'm first looking to build the basic architecture and get it running.

I had a few questions about the above:

a) Is there any plan to implement anything similar to this already that I should be aware of before beginning?
b) How should I namespace the new plug in type (which should be automatically loaded by my script but not EPrints) and objects used by the cron job (which wouldn't really constitute plug ins, but rather helpers to that script) to avoid future conflicts?  
c) Is there anything I need to be aware of when creating a new table to store EPrints? In terms of scaling and keeping the main EPrints table relatively small I think it makes sense to keep the pre-import EPrints in their own table, but it will be essentially exactly the same as the current one. I'm also worried about duplicating code for listing and displaying individual EPrints. But I don't think they should be imported directly into the work area without the user's consent.

Thank you for your time and any help you can provide!

Best wishes, Rob

* Although I am aware some kind of pull imports would be performed much more efficiently through doing bulk imports for a given institution rather than searching on a per user basis. I think this would be better performed by an external script that updates a local cache - the pull-based import plug in would then query this cache.