[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[EP-tech] Multiple Uploaded Files in One Directory



Hi Alan and John,

Thank you both for the advice, it's really helpful.

I'll go with Alan's solution in the immediate term and John, if you have
the time, some more info on your solution would be brilliant.

>From what I can see, this doesn't happen very often, but I'd much prefer if
it didn't happen at all!

Thanks,
James

On Wed, Aug 15, 2018 at 12:45 PM, John Salter <J.Salter at leeds.ac.uk> wrote:

> Hi James,
>
> Welcome to EPrints :o)
>
>
>
> When EPrints resolves a URL, it uses the eprintid and pos to get the
> document data object via
>
> EPrints::DataObj::Document::doc_with_eprintid_and_pos
>
>
>
> Normally there would only be one object returned - and the document that
> 'works' is the first one returned by the above call.
>
>
>
> Onto the question about how items get into this state:
>
> This sounds very similar to an issue we had with our Symplectic connector
> - and how it merged two EPrints together when the corresponding Symplectic
> items were merged together. This ends up with two documents attached to the
> same EPrint existing in the same 'pos'.
>
>
>
> EPrints' default behaviour is to remove the 'pos' during a clone *only*
> when the doc is being cloned to the same parent:
> https://github.com/eprints/eprints/blob/3.3/perl_lib/
> EPrints/DataObj/Document.pm#L374
>
>
>
> In some circumstances, this is not the correct course of action - EPrints
> should check that a doc doesn't already exist at that pos for that eprint.
>
>
>
> I flagged the issue to Symplectic - thes ticket reads:
>
> #################
>
> We've discovered an issue with the Elements/EPrints connector:
>
> EPrint ID 1; document: A.pdf with pos=1.
>
> EPrint ID 2; document: B.pdf with pos=1.
>
>
>
> If both of these are attached to Elements records, which are then merged,
> the resulting EPrint ends up with two documents at pos=1.
>
> This is not meant to happen, and will mean that one of the documents is
> unreachable.
>
>
>
> The 'real' bug lies in EPrints - but the connector 'tickles' it when two
> records are merged - and the $document->clone() method is used (which
> possibly should be flagged as an 'internal' EPrints method).
>
> #################
>
>
>
> I've created a fix for the Symplectic connector - and submitted it to them
> for review/release as a new version of RT1.
>
> As yet this hasn't been released.
>
>
>
> The specific fix I have for the Symplectic connector is (also saved as:
> https://gist.github.com/jesusbagpuss/d9e292bd4dd222f5199a36747989f708) in
> case the code below gets mangled by email transport):
>
>
>
> ############################################################
> ###############################
>
> # Based on EPrints::DataObj::Document::clone
>
> # NB Code duplication with Symplectic::RepoProcess::MergeManager
>
> #
>
> # Cloning documents can result in:
>
> # - two documents with the same 'pos' field - and therefore sharing the
> same folder
>
> # - 'spaces' in the document structure (e.g. pos=1 and pos=3, but no pos=2)
>
> # this isn't what is needed. The code below manages these scenarios.
>
> # EPrints' default behaviour is to remove the 'pos' during a clone *only*
> when the doc is being cloned to the same parent.
>
> sub clone_document
>
> {
>
>         my ($self, %args ) = @_;
>
>         my $eprint = $args{'eprint'};
>
>         my $doc = $args{'doc'};
>
>         my $reset_pos = $args{'reset_pos'};
>
>
>
>         my $data = EPrints::Utils::clone( $doc->{data} );
>
>
>
>         # cloning within the same eprint, in which case get a new position!
>
>         #if( defined $doc->parent && $eprint->id eq $doc->parent->id )
>
>         if( ( defined $doc->parent && $eprint->id eq $doc->parent->id ) ||
> $reset_pos )
>
>         {
>
>                 $data->{pos} = undef;
>
>         }
>
>
>
>         $data->{eprintid} = $eprint->get_id;
>
>         $data->{_parent} = $eprint;
>
>
>
>         # First create a new doc object
>
>         my $new_doc = $doc->{dataset}->create_object( $doc->{session},
> $data );
>
>         return undef if !defined $new_doc;
>
>
>
>         my $ok = 1;
>
>
>
>         # Copy files
>
>         foreach my $file (@{$doc->get_value( "files" )})
>
>         {
>
>                 $file->clone( $new_doc ) or $ok = 0, last;
>
>         }
>
>
>
>         if( !$ok )
>
>         {
>
>                 $new_doc->remove();
>
>                 return undef;
>
>         }
>
>
>
>         return $new_doc;
>
> }
>
> ############################################################
> ###############################
>
>
>
> NB There are also some other changes requires in the Symplectic connector
> to make this work. If you'd like more information about this fix, let me
> know!
>
>
>
> If you want to know how many items in your repository are affected by the
> 'duplicated pos' issue, try:
>
> On the database, you can detect how many of your EPrints have this issue
> using the following SQL:
>
> SELECT
>
>   eprintid, pos, count(*) as c
>
> FROM
>
>   document
>
> GROUP BY
>
>   eprintid, pos
>
> HAVING c > 1;
>
>
>
> If there are a few items, you may be able to resolve them by human effort.
>
> If there are lots, then some scripting might be needed?
>
>
>
> Does that help at all?
>
> Cheers,
>
> John
>
>
>
>
>
> *From:* eprints-tech-bounces at ecs.soton.ac.uk [mailto:eprints-tech-bounces@
> ecs.soton.ac.uk] *On Behalf Of *James Kerwin
> *Sent:* 15 August 2018 10:20
> *To:* eprints-tech at ecs.soton.ac.uk
> *Subject:* [EP-tech] Multiple Uploaded Files in One Directory
>
>
>
> Morning all,
>
>
>
> I'm very new to the world of EPrints and I'm still getting to grips with
> it.
>
>
>
> I was alerted to a problem today where a file uploaded to Eprints is
> giving a "404 File not Found" warning when attempting to view/download the
> document.
>
>
>
> On the repository server the document is present but appears in the same
> directory as another document (which can be accessed through eprints).
> There is then a a third document in a second directory that can be accessed.
>
>
>
> Looking in the database I can see that all three documents are public and
> should be accessible.
>
>
>
> As I understand it, the URL matches the file structure as:
>
>
>
> www.[repository_name].com/[EPrintID]/[DocPos]/[DocName]
>
>
>
> And on the server are stored somewhere in the Eprints directory as:
>
>
>
> [EP/ri/nt/sI/d]/DocPos/document.pdf
>
>
>
> As in a one-to-one between DocPos and doc name (I've looked at some other
> examples with more than 2 documents in one EPrint and each one follows this
> so far).
>
>
>
> Firstly, are my assumptions correct?
>
> Has anybody had a similar thing happen before?
>
>
>
> Thanks,
>
> James
>
> *** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
> *** Archive: http://www.eprints.org/tech.php/
> *** EPrints community wiki: http://wiki.eprints.org/
> *** EPrints developers Forum: http://forum.eprints.org/
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ecs.soton.ac.uk/pipermail/eprints-tech/attachments/20180816/c1ab3134/attachment-0001.html