[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[EP-tech] Re: Extending EPrints with pull-based import plug ins

Hi Rob,

See comments below:

On 17/05/12 10:42, Berry, Rob wrote:
> Hey,
> This is a bit of an in-depth query, so I'm not sure whether I'm directing it to the right place. If I'm not, apologies in advance.
> I'm currently working on an extension to EPrints that will implement pull-based imports. The basic idea is that rather than have researchers at our institution have to manually search for documents they have uploaded to external repositories such as Web of Science or SciVerse, this should be performed automatically by the repository, alerting them when new documents have been found and allowing them to select which to import.
> I would implement this with a few separate components:
> 1. New kind of plug in for pull-based imports
>      The basic interface will take an EPrints user, find associated documents on an external source, convert them into EPrints, and return them to the calling script.* It should be easy to extend the system with new external sources this way.
Perhaps look at the OAI harvester on files.eprints.org, it's doing 
something similar but for OAI.

> 2. Script / daemon for performing the imports.
>      I'm initially going to implement this as a cron job that runs nightly, iterating through every user on the repository, and using each import component to build a list of prints. This will be checked against the local cache of imported prints to eliminate duplicates and also a list of prints that users on the system have 'deleted' - i.e. said they do not want to import or see any more.
> 3. Screen plug in for viewing outstanding imports.
>     This would work like the work area, displaying a paginated list of EPrints available for import into the work area. Each one would have a button to delete it permanently or import it, and also checkboxes, so multiple imports / deletes could be performed. There will also be a select element for filtering results by source. There will also be a screen plug in to actually view the import's details, which will work exactly the same as the current view plug in.
> I also plan to extend this with a MePrints widget and e-mail notification, but I'm first looking to build the basic architecture and get it running.
> I had a few questions about the above:
> a) Is there any plan to implement anything similar to this already that I should be aware of before beginning?
I don't think so, it's a neat idea though.

> b) How should I namespace the new plug in type (which should be automatically loaded by my script but not EPrints) and objects used by the cron job (which wouldn't really constitute plug ins, but rather helpers to that script) to avoid future conflicts?
I'd put them into /opt/eprints3/perl_lib/

> c) Is there anything I need to be aware of when creating a new table to store EPrints? In terms of scaling and keeping the main EPrints table relatively small I think it makes sense to keep the pre-import EPrints in their own table, but it will be essentially exactly the same as the current one. I'm also worried about duplicating code for listing and displaying individual EPrints. But I don't think they should be imported directly into the work area without the user's consent.
EPrints can have four "states": "inbox" (User area), "buffer" (aka 
Review), "archive" (Live) and "deletion". Try to add an extra state 
"imported" and copy the code in Screen::Review but adapt to use your new 
state. This way everything's stored in the eprint table but is hidden 
from EPrints.

Add a few plugins to control the states (eg. move from "imported" to 
"inbox"), see Screen::EPrint::Remove for inspiration.

Also check the existing dataset 'import' which records an import and the 
field 'importid' in the EPrint dataset which link an eprint to an import.

If possible, share your code to files.eprints.org, I'm sure others will 
be interested.

Hope this helps,