EPrints Technical Mailing List Archive

Message: #00595


< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First

[EP-tech] Re: Extending EPrints with pull-based import plug ins


It's not the same approach, but this is what I've been working on:
https://docs.google.com/presentation/d/1FV54KEfhbsB2dhqsm6i_yz8dXdcavrK9WBVgC7YqZag/edit

The idea is to use a new dataset that will store serialised epdata
(blobs). That allows your search results to be cached and waiting for
the user to import/ignore the entries they're interested in.

For my use-case the cache is transitory - the data will be deleted after
24 hours. Lastly, the live items are tied into search results via
eprint.source (the PMH tool is smarter because it checks datestamps, so
supports updates).

/Tim.

On Mon, 2012-05-21 at 10:49 +1000, Mark Gregson wrote:
> Hi Bob
> 
> We were planning to work on something similar, starting sometime in the
>  next few months, I think.  We haven't done any serious analysis/design
>  yet but it appears we're trying to do the same thing, i.e.,
>  automatically initiating deposit using metadata from external sources
>  and then pushing the stub records to the researchers to complete the
>  deposit. I'll be keen to see what you do, perhaps we can build on it
>  and contribute code back.
> 
> Cheers
> Mark
> 
> Mark Gregson | Application and Development Team Leader
> Library eServices | Queensland University of Technology
> Level 2 | I Block | Kelvin Grove Campus | GPO Box 2434 | Brisbane 4001
> Phone: +61 7 3138 3782 | Web: http://www.qut.edu.au/
> ABN: 83 791 724 622
> CRICOS No: 00213J
> 
> -----Original Message-----
> From: eprints-tech-bounces@ecs.soton.ac.uk [mailto:eprints-tech-bounces@ecs.soton.ac.uk] On Behalf Of Berry, Rob
> Sent: Thursday, 17 May 2012 7:43 PM
> To: eprints-tech@ecs.soton.ac.uk
> Subject: [EP-tech] Extending EPrints with pull-based import plug ins
> 
> Hey,
> 
> This is a bit of an in-depth query, so I'm not sure whether I'm directing it to the right place. If I'm not, apologies in advance.
> 
> I'm currently working on an extension to EPrints that will implement pull-based imports. The basic idea is that rather than have researchers at our institution have to manually search for documents they have uploaded to external repositories such as Web of Science or SciVerse, this should be performed automatically by the repository, alerting them when new documents have been found and allowing them to select which to import.
> 
> I would implement this with a few separate components:
> 
> 1. New kind of plug in for pull-based imports
>     The basic interface will take an EPrints user, find associated documents on an external source, convert them into EPrints, and return them to the calling script.* It should be easy to extend the system with new external sources this way.
> 2. Script / daemon for performing the imports.
>     I'm initially going to implement this as a cron job that runs nightly, iterating through every user on the repository, and using each import component to build a list of prints. This will be checked against the local cache of imported prints to eliminate duplicates and also a list of prints that users on the system have 'deleted' - i.e. said they do not want to import or see any more.
> 3. Screen plug in for viewing outstanding imports. 
>    This would work like the work area, displaying a paginated list of EPrints available for import into the work area. Each one would have a button to delete it permanently or import it, and also checkboxes, so multiple imports / deletes could be performed. There will also be a select element for filtering results by source. There will also be a screen plug in to actually view the import's details, which will work exactly the same as the current view plug in.
> 
> I also plan to extend this with a MePrints widget and e-mail notification, but I'm first looking to build the basic architecture and get it running.
> 
> I had a few questions about the above:
> 
> a) Is there any plan to implement anything similar to this already that I should be aware of before beginning?
> b) How should I namespace the new plug in type (which should be automatically loaded by my script but not EPrints) and objects used by the cron job (which wouldn't really constitute plug ins, but rather helpers to that script) to avoid future conflicts?  
> 
> c) Is there anything I need to be aware of when creating a new table to store EPrints? In terms of scaling and keeping the main EPrints table relatively small I think it makes sense to keep the pre-import EPrints in their own table, but it will be essentially exactly the same as the current one. I'm also worried about duplicating code for listing and displaying individual EPrints. But I don't think they should be imported directly into the work area without the user's consent.
> 
> Thank you for your time and any help you can provide!
> 
> Best wishes, Rob
> 
> * Although I am aware some kind of pull imports would be performed much more efficiently through doing bulk imports for a given institution rather than searching on a per user basis. I think this would be better performed by an external script that updates a local cache - the pull-based import plug in would then query this cache.
> *** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
> *** Archive: http://www.eprints.org/tech.php/
> *** EPrints community wiki: http://wiki.eprints.org/
> 
> *** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
> *** Archive: http://www.eprints.org/tech.php/
> *** EPrints community wiki: http://wiki.eprints.org/

Attachment: signature.asc
Description: This is a digitally signed message part