See the Mailing Lists Page for how to subscribe and unsubscribe.
eprints_tech messages
[EP-tech] Other little things...
From: ePrints Support <support AT eprints.org>
Date: Mon, 9 Apr 2001 10:45:29 +0100
I've decided to explicitly support ISO-LATIN-1, I know that OAI uses UTF8 but if I don't limit my scope of features I'll never finish. I'm thinking of redesigning the web-based registration. Well in fact, I'm going to, but I'm thinking about how. I'm thinking of having a page which asks you to enter your email, the system then generates a random password for that email, sets the username to be the SAME as the email and mails that person their password. This will close the can of worms which is the process_mail script, which I think has been the single biggest support problem. This also adds the neat feature, that if you have "local" and ↵ "signup" users, eg. members of your dept. and just random people off the web, then you know that the "signup" users, will have an " AT " in ↵ their username which means you can never have a problem I've been worried about which is: Our department allocates usernames to it's users, we want them to be able to use the same username/password as for email and logins. Problem: If someone signs up on the web for a username which is free, but then the dept. admin allocate that username internally 3 months later - not to be able to give a member of our dept. a username because someone on the eprints system is already using it is a case of the Tail wagging the dog. Hence the email address solution above is a nice solution. This dons't mean local users couldn't all log in via email address too - email addresses have the nice feature of being unique (well, most of the time). I'm toying with removing process_mail from the next minor release of eprints1.1 as it's causing a lot of pain - I would make the "username" requested ↵ be the email address minus everything after the " AT ". Would people think that this is great or the work of the devil? For a demonstration of how this system would work, the following web ↵ "service" has been running for some time without serious problem, feel free to "poke ↵ it" if you want to how the web based signup would work (ish): http://totl.net/VCash/ -- Christopher Gutteridge support AT eprints.org ePrints Technical Support +44 23 8059 4833
[EP-tech] Slowly but surely
From: ePrints Support <support AT eprints.org>
Date: Mon, 9 Apr 2001 10:32:03 +0100
Latest things I've been working on: XHTML, stylesheets & DOM. I've been changing the internals of eprints so that rather than generating a webpage by doing: print "<P>"; print "My Cat"; print "</P>" The system now builds the entire page as an xml tree then prints it: use XML::DOM; $page = new XML::DOM::Document; $p = $page->createElement( "p" ); $cat = $page->createText( "My Cat" ); $p->appendChild( $cat ); $page->appendChild( $p ); OK, that dosn't produce a legal page (no <html> etc) but you get the ↵ idea. XHTML is, in a nutshell HTML represented in XML (which can still be parsed by current browsers). The practical differences are that all element and attribute names must be lower case, and that ALL elements must be closed. Eg. <BR></BR> which can be abbv'd. to: <BR /> The template for pages in the config module must be in xhtml, as the system parses it into a tree when it starts up (not every page request). This is slightly harder, but means that the system will always produce well formed pages. Which is nice. I've been removing as much markup as I can from the HTML (XHTML now!) produced by eprints, and replacing it with class="foo" attributes to each ↵ part, so that the admin can control the look of the generated pages without having to hack at the code. I've changed the way "citation configuration" works - this is the way ↵ in which a record is rendered into a single string, for search result pages etc. The new system is, shock!, XML - or more accurately an XHTML fragment with two extra elements: <IF> and <FIELD> (note these *are* uppercase so ↵ they are clearly distinct from the XHTML elements. Eg. for conf paper: <FIELD name="authors"/> <IF ↵ name="year">(<FIELD name="year"/>) </IF><FIELD name="title"/>. In <IF ↵ name="editors"><FIELD name="editors"/>, Eds. </IF><IF name="conference"><i>Proceedings <FIELD name="conference" /></i></IF><IF ↵ name="volume"> <b><FIELD name="volume" /></b></IF><IF ↵ name="number"> (<FIELD name="number" />)</IF><IF ↵ name="pages">, pages <FIELD name="pages" /> </IF><IF name="confloc">, <FIELD ↵ name="confloc" /></IF>. I know this is more verbose but it allows control of exactly what you want rather than the old system which was yet another config system. When rendering the system first removes all <IF> elements, leaving their contents behind IF the named field is not empty in this record, then the system replaces <FIELD> tags with the value of the named field. I'm considering doing something similar with the language configuration files, using XML. -- Christopher Gutteridge support AT eprints.org ePrints Technical Support +44 23 8059 4833
Re: [EP-tech] Problems with OO Design
From: Clayton Carter <crcarter AT cs.indiana.edu>
Date: Mon, 26 Mar 2001 15:10:59 -0500
| Threading: | ↑ [EP-tech] Problems with OO Design from crcarter AT cs.indiana.edu • This Message |
This is good to hear. The more I learn about the next release, the more excited I get about it. Who ever said serious software can't be cool? :) --pc On Fri, Mar 23, 2001 at 12:41:58PM +0000, ePrints Support wrote: > Although under normal circumstances, I'd take a patch and say thankyou, > the current development version of EPrints looks VERY different. > > I've been slowly phasing out variables like the one you mention below, > I'll make even more of an effort now. > > I would like to make a read-only version of the CVS available at some > point but at the moment there's a lot of 'scaffold' to make things work > as I rewrite various sections. I'd spend all my time helping people > get it working. > > One neat little change is that it will load all the modules when apache > starts, NOT each time it spawns a sub process. > > Also SiteInfo, SiteRoutine etc are going to be rolled into one module, > which will be named after the id of your site eg. > EPrintSite/cogprints.pm > > This module will provide an object which will represent the site ↵ configuration > including references to methods for validation, rendering etc. > > It decides what config module to load based on the host and path of the ↵ URL > request and does some reflection, which perl is good at. > > > On Thu, Mar 22, 2001 at 03:26:10PM -0500, Clayton Carter wrote: > -- Clayton Carter crcarter AT cs.indiana.edu "My mom says I'm the handsomest guy in school."
Re: [EP-tech] Problems with OO Design
From: ePrints Support <support AT eprints.org>
Date: Fri, 23 Mar 2001 16:55:03 +0000
| Threading: | ↑ [EP-tech] Problems with OO Design from crcarter AT cs.indiana.edu • This Message |
Enough already :-) I've set up a nightly script which knocks together a tar file of the code. This is available from http://www.ecs.soton.ac.uk/~cjg/eprints/ For those of you not familiar with development code, don't be surprised if it (A) dosn't work and (B) isn't quite as well commented as the release version... On Fri, Mar 23, 2001 at 02:39:10PM -0000, Tim Brody wrote: > > I would like to make a read-only version of the CVS available at some > > point but at the moment there's a lot of 'scaffold' to make things ↵ work > > as I rewrite various sections. I'd spend all my time helping people > > get it working. > > Awww ... you just want to keep it to yourself don't you? > > Seriously, does it really matter? After all, compiling from CVS is only ↵ for > developers, if you want a working system you should download a release > version... > > (One of the reasons I've not got around to looking at ePrint internals is > because it would involve downloading/installing, source code access via ↵ CVS > obviates that need) > > Regards, > Tim. -- Christopher Gutteridge support AT eprints.org ePrints Technical Support +44 23 8059 4833
Re: [EP-tech] Problems with OO Design
From: "Tim Brody" <tdb198 AT soton.ac.uk>
Date: Fri, 23 Mar 2001 14:39:10 -0000
| Threading: | ↑ [EP-tech] Problems with OO Design from crcarter AT cs.indiana.edu • This Message |
> I would like to make a read-only version of the CVS available at some > point but at the moment there's a lot of 'scaffold' to make things work > as I rewrite various sections. I'd spend all my time helping people > get it working. Awww ... you just want to keep it to yourself don't you? Seriously, does it really matter? After all, compiling from CVS is only for developers, if you want a working system you should download a release version... (One of the reasons I've not got around to looking at ePrint internals is because it would involve downloading/installing, source code access via CVS obviates that need) Regards, Tim.
Re: [EP-tech] Problems with OO Design
From: ePrints Support <support AT eprints.org>
Date: Fri, 23 Mar 2001 12:41:58 +0000
| Threading: | ↑ [EP-tech] Problems with OO Design from crcarter AT cs.indiana.edu • This Message |
Although under normal circumstances, I'd take a patch and say thankyou,
the current development version of EPrints looks VERY different.
I've been slowly phasing out variables like the one you mention below,
I'll make even more of an effort now.
I would like to make a read-only version of the CVS available at some
point but at the moment there's a lot of 'scaffold' to make things work
as I rewrite various sections. I'd spend all my time helping people
get it working.
One neat little change is that it will load all the modules when apache
starts, NOT each time it spawns a sub process.
Also SiteInfo, SiteRoutine etc are going to be rolled into one module,
which will be named after the id of your site eg.
EPrintSite/cogprints.pm
This module will provide an object which will represent the site configuration
including references to methods for validation, rendering etc.
It decides what config module to load based on the host and path of the URL
request and does some reflection, which perl is good at.
On Thu, Mar 22, 2001 at 03:26:10PM -0500, Clayton Carter wrote:
> Thanks for the note about the indexed searching. That's great
> news and I'm glad to hear it.
>
> I've now a request/complaint/rambling string of sentences. We
> (at the Indiana University Digital Library Program) are really pleased
> with EPrints but, as usual, there are things we've need to change or
> tweak. I've accomplished this in a very OO way (and a rather -- to my
> senses, at least -- elegant way) as such:
>
> EPrints::SubmissionForm::update_from_subject_form() needed to
> be changed. (Well, maybe it didn't *need* to be changed, but this is
> the solution I came up with.)
>
> In order to do this, I moved SubmissionForm.pm to
> SubmissionFormTheirs.pm (perhaps `EP' or `Orig' would have been a
> better name) and did a global search replace in that file to change
> all occurances of `EPrints::SubmissionForm' to
> `EPrints::SubmissionFormTheirs'. I then made a symlink from
> `perl_lib/EPrints/SubmissionForm.pm to
> `perl_lib/DLIB/SubmissionForm.pm' to help us keep all of our changes
> in one place. `DLIB/SubmissionForm.pm' looked something like this:
>
> package EPrints::SubmissionForm;
>
> ...
>
> use base qw(EPrints::SubmissionFormTheirs); # combined use and AT ISA
>
> ...
>
> sub update_from_subject_form
> {
> ...
> }
>
> 1;
>
> This worked great, except for one problem. A few of the
> scripts, like `cgi/staff/view_submission', refered directly to class
> variables stored in the old `EPrints::SubmissionForm' (now
> `EPrints::SubmissionFormTheirs'). The script broke because of the
> fact that my inheriting package didn't have the some globals defined.
> (In particular, the access to the `action_*' variables is what caused
> the problems.) The short solution was to just copy all of the
> `action_*' and `stage_*' variables into my package.
>
> Keep in mind that, while OO is -- for the most part -- old hat
> to me, I'm still getting used to Perl's OO stuff. Basically, I'm
> wondering if it's at all on the plate to modify things like this for a
> more OO friendly interface. And I mean this in the simplest of ways:
> providing a method which takes an action_type and returns the
> associated text.
>
> I guess this isn't a big deal, but it would make the system
> that much easier to extend. I don't think that this particular
> example would be that hard to implement, so if I put together a patch
> to make this change, what format would be prefered? (I'm none too
> well aquainted with unified diffs.)
>
> --pc
>
> --
> Clayton Carter crcarter AT cs.indiana.edu
> "My mom says I'm the handsomest guy in school."
--
Christopher Gutteridge support AT eprints.org
ePrints Technical Support +44 23 8059 4833
[EP-tech] Problems with OO Design
From: Clayton Carter <crcarter AT cs.indiana.edu>
Date: Thu, 22 Mar 2001 15:26:10 -0500
| Threading: | • This Message → Re: [EP-tech] Problems with OO Design from support AT eprints.org → Re: [EP-tech] Problems with OO Design from tdb198 AT soton.ac.uk → Re: [EP-tech] Problems with OO Design from support AT eprints.org → Re: [EP-tech] Problems with OO Design from crcarter AT cs.indiana.edu |
Thanks for the note about the indexed searching. That's great
news and I'm glad to hear it.
I've now a request/complaint/rambling string of sentences. We
(at the Indiana University Digital Library Program) are really pleased
with EPrints but, as usual, there are things we've need to change or
tweak. I've accomplished this in a very OO way (and a rather -- to my
senses, at least -- elegant way) as such:
EPrints::SubmissionForm::update_from_subject_form() needed to
be changed. (Well, maybe it didn't *need* to be changed, but this is
the solution I came up with.)
In order to do this, I moved SubmissionForm.pm to
SubmissionFormTheirs.pm (perhaps `EP' or `Orig' would have been a
better name) and did a global search replace in that file to change
all occurances of `EPrints::SubmissionForm' to
`EPrints::SubmissionFormTheirs'. I then made a symlink from
`perl_lib/EPrints/SubmissionForm.pm to
`perl_lib/DLIB/SubmissionForm.pm' to help us keep all of our changes
in one place. `DLIB/SubmissionForm.pm' looked something like this:
package EPrints::SubmissionForm;
...
use base qw(EPrints::SubmissionFormTheirs); # combined use and AT ISA
...
sub update_from_subject_form
{
...
}
1;
This worked great, except for one problem. A few of the
scripts, like `cgi/staff/view_submission', refered directly to class
variables stored in the old `EPrints::SubmissionForm' (now
`EPrints::SubmissionFormTheirs'). The script broke because of the
fact that my inheriting package didn't have the some globals defined.
(In particular, the access to the `action_*' variables is what caused
the problems.) The short solution was to just copy all of the
`action_*' and `stage_*' variables into my package.
Keep in mind that, while OO is -- for the most part -- old hat
to me, I'm still getting used to Perl's OO stuff. Basically, I'm
wondering if it's at all on the plate to modify things like this for a
more OO friendly interface. And I mean this in the simplest of ways:
providing a method which takes an action_type and returns the
associated text.
I guess this isn't a big deal, but it would make the system
that much easier to extend. I don't think that this particular
example would be that hard to implement, so if I put together a patch
to make this change, what format would be prefered? (I'm none too
well aquainted with unified diffs.)
--pc
--
Clayton Carter crcarter AT cs.indiana.edu
"My mom says I'm the handsomest guy in school."
Re: [EP-tech] ePrints and LDAP Authentication
From: ePrints Support <support AT eprints.org>
Date: Thu, 22 Mar 2001 11:18:51 +0000
| Threading: | ↑ [EP-tech] ePrints and LDAP Authentication from W.Nixon AT lib.gla.ac.uk • This Message |
I've been working on the ground work for this, this morning. We have a similar situation - I want local users to use their encrypted password and UNIX username, but I still want external users to be able to sign up for subscriptions. I'm currently thinking that in this situation, non-"local" users will be forced to have a prefix to their userid (eg " AT ") to make sure ↵ they don't cross into the same range as the local users. I'm thinking of being able to have N different kinds of users, rather than the current "Staff" and "User". Each user type will ↵ have permission to use various features: basic user stuff (password change, set info) subscriptions deposit papers staff searches & siteinfo editorial (approving/rejecting items) admin - modifying other users etc. and each type of user will have their own authentication method provided as a call back to a mod_perl authentication module authentication handler. For my example, I would have "Staff", "Local" and ↵ "Guest" types with differ- ent privs. in the system. All 3 would use the Apache::AuthDBI::authen method but "Guest" would set the enrypted password to "off" ↵ and "Staff" and "Local" would set encrypted to "on". This section would also configure if a usertype was allowed to set their password or not - it dosn't make sense for users who are having their eprints account initialised from an external source to be able to change their password. How does this apply to LDAP? If all goes to plan, your local users can use the perl LDAP authentication module instead, and totally ignore the password field in eprints, although you will still need to dump info into the database from the LDAP system for things like their name, email etc. This is all in development and suggestions, feedback, abuse etc are welcome. On Tue, Mar 20, 2001 at 04:17:52PM -0000, William Nixon wrote: > Chris hi, > > Thanks for the very useful overview of ePrints, including the back ↵ history. > Glad that you have garnered so much interest. > > Here at the University of Glasgow we have installed ePrints v.1.0 and are > looking at the possibilities which it offers us. > > One [currently missing] element which we are very interested in though is > the possibility of ePrints using the directory protocol LDAP to ↵ authenticate > users via our existing user databases on campus. Either that or other > alternatives which would allow them to use existing usernames and ↵ passwords > rather than creating new ones - and so adding to the very long list of > logins which they already use. > > Is this something which other institutions would also be interested, how > critical do they see the issue of authentication? How do they currently ↵ deal > with their ePrints authentication? > > William J Nixon > > == > William J Nixon, Deputy Head of IT Services / Project Co-ordinator > Glasgow University Library, Hillhead Street, Glasgow, G12 8QE, Scotland, ↵ UK > e-mail: w.j.nixon AT lib.gla.ac.uk www: > http://www.lib.gla.ac.uk/staff/wnixon > tel: +44 (0)141 330 6721 fax: +44 (0)141 330 4952 -- Christopher Gutteridge support AT eprints.org ePrints Technical Support +44 23 8059 4833
Re: [EP-tech] Accented Characters
From: ePrints Support <support AT eprints.org>
Date: Tue, 20 Mar 2001 21:07:26 +0000
| Threading: | ↑ [EP-tech] Accented Characters from crcarter AT cs.indiana.edu • This Message |
aha. The new version I'm working on keeps an index of words which
are in each record.
To improve matching it removes 's' from the end of an index word
and some other bits and bobs. When you search for some words it
then uses the same metrics. Unless you ask for an exact search, a
search for "beasts" will look up "beast" in the word index.
How is this related? When I put things into the word index I'll
will appear in searches for 'Gödel' and for 'Godel' (and Godels
for that matter).
In fact the routine which decides which words should be indexed,
and as what, is in the SiteRoutines so that you can tweak it to
the needs of your archive.
Here's the current version, it'll probably change a bit before I'm
done. For example I plan to remove all accents before indexing the
words.
Sound good?
######################################################################
#
# extract_words( $text )
#
# This method is used when indexing a record, to decide what words
# should be used as index words.
# It is also used to decide which words to use when performing a
# search.
#
# It returns references to 2 arrays, one of "good" words which ↵
should
# be used, and one of "bad" words which should not.
#
######################################################################
sub extract_words
{
my( $text ) = AT _;
# Remove single quotes so "don't" becomes "dont"
$text =~ s/'//g;
# Normalise acronyms eg.
# The F.B.I. is like M.I.5.
# becomes
# The FBI is like MI5
my $a;
$text =~ s#[A-Z0-9]\.([A-Z0-9]\.)+#$a=$&;$a=~s/\.//g;$a#ge;
# Remove hyphens from acronyms
$text=~ s#[A-Z]-[A-Z](-[A-Z])*#$a=$&;$a=~s/-//g;$a#ge;
# Replace any non alphanumeric characters with a space instead
$text =~ s/[^a-zA-Z0-9]/ /g;
# Iterate over every word (space seperated values)
my AT words = split /\s+/ , $text;
# We use hashes rather than arrays at this point to make
# sure we only get each word once, not once for each occurance.
my %good = ();
my %bad = ();
foreach( AT words )
{
# skip if this is nothing but whitespace;
next if /^\s*$/;
# calculate the length of this word
my $wordlen = length $_;
# $ok indicates if we should index this word or not
# First approximation is if this word is over or equal
# to the minimum size set in SiteInfo.
my $ok = $wordlen >= $EPrintSite::SiteInfo::freetext_min_word_size;
# If this word is at least 2 chars long and all capitals
# it is assumed to be an acronym and thus should be indexed.
if( m/^[A-Z][A-Z0-9]+$/ )
{
$ok=1;
}
# Consult list of "never words". Words which should never
# be indexed.
if( $EPrintSite::SiteInfo::freetext_never_words{lc $_} )
{
$ok = 0;
}
# Consult list of "always words". Words which should always
# be indexed.
if( $EPrintSite::SiteInfo::freetext_always_words{lc $_} )
{
$ok = 1;
}
# Add this word to the good list or the bad list
# as appropriate.
if( $ok )
{
# Only "bad" words are used in display to the
# user. Good words can be normalised even further.
# non-acronyms (ie not all UPPERCASE words) have
# a trailing 's' removed. Thus in searches the
# word "chair" will match "chairs" and vice-versa.
# This isn't perfect "mose" will match "moses" and
# "nappy" still won't match "nappies" but it's a
# reasonable attempt.
s/s$//;
# If any of the characters are lowercase then lower
# case the entire word so "Mesh" becomes "mesh" but
# "HTTP" remains "HTTP".
if( m/[a-z]/ )
{
$_ = lc $_;
}
$good{$_}++;
}
else
{
$bad{$_}++;
}
}
# convert hash keys to arrays and return references
# to these arrays.
my( AT g ) = keys %good;
my( AT b ) = keys %bad;
return( \ AT g , \ AT b );
}
On Tue, Mar 20, 2001 at 02:39:58PM -0500, Clayton Carter wrote:
> Hi All,
>
> I've a few questions and/or concerns related to diacritics.
> Primarily, I'm concerned with how they're handled for the database,
> search and display mechanisms.
>
> Currently, it looks as though the characters put into the HTML
> form are copied directly into the database and then displayed `as is'
> in the HTML. I say `as is' because they aren't translated into any
> funny looking HTML escape sequences like Ö . Also, when
> searching for words containing accents, one must specify the exact
> character sequence (complete with accents) in order to score a hit.
>
> I can certainly see how/why this is considered correct
> behavior, but I was wondering if there was any more mature
> implementation planned? I've been told that this is some sort of
> outstanding problem in library databases, but that the usual way to
> deal with this problem is to put a nonprinting character in front of
> the accented character that tells the system a) that the next
> character is an accented character and b) what a vaild substitute
> would be. So, as I understand it, the database would store something
> like `Goödel' and searchs would be able to match both `Godel' and
> `Gödel', but the webpages would always be able to display `Gödel'.
> (Forgive me if these characters don't come out right. I'm not
> accustomed to have to use them.)
>
> That said, is there anything of that sort (or something else,
> perhaps WAY better than I could possibly ever think of) in the works?
> Does anyone running the software have any kind of policy toward
> accented characters or specials instruction for users utilizing these
> characters? Does anyone have any suggestions about how to better deal
> with this issue?
>
> Thanks! Let me know if I've not made myself clear.
>
> --pc
>
> PS - I also want to thank Chris for the braindump. The background
> info is interesting and I'm excited at some of the features being
> mulled over and worked on.
>
> --
> Clayton Carter crcarter AT cs.indiana.edu
> "My mom says I'm the handsomest guy in school."
--
Christopher Gutteridge support AT eprints.org
ePrints Technical Support +44 23 8059 4833
[EP-tech] Accented Characters
From: Clayton Carter <crcarter AT cs.indiana.edu>
Date: Tue, 20 Mar 2001 14:39:58 -0500
| Threading: | • This Message → Re: [EP-tech] Accented Characters from support AT eprints.org |
Hi All, I've a few questions and/or concerns related to diacritics. Primarily, I'm concerned with how they're handled for the database, search and display mechanisms. Currently, it looks as though the characters put into the HTML form are copied directly into the database and then displayed `as is' in the HTML. I say `as is' because they aren't translated into any funny looking HTML escape sequences like Ö . Also, when searching for words containing accents, one must specify the exact character sequence (complete with accents) in order to score a hit. I can certainly see how/why this is considered correct behavior, but I was wondering if there was any more mature implementation planned? I've been told that this is some sort of outstanding problem in library databases, but that the usual way to deal with this problem is to put a nonprinting character in front of the accented character that tells the system a) that the next character is an accented character and b) what a vaild substitute would be. So, as I understand it, the database would store something like `Goödel' and searchs would be able to match both `Godel' and `Gödel', but the webpages would always be able to display `Gödel'. (Forgive me if these characters don't come out right. I'm not accustomed to have to use them.) That said, is there anything of that sort (or something else, perhaps WAY better than I could possibly ever think of) in the works? Does anyone running the software have any kind of policy toward accented characters or specials instruction for users utilizing these characters? Does anyone have any suggestions about how to better deal with this issue? Thanks! Let me know if I've not made myself clear. --pc PS - I also want to thank Chris for the braindump. The background info is interesting and I'm excited at some of the features being mulled over and worked on. -- Clayton Carter crcarter AT cs.indiana.edu "My mom says I'm the handsomest guy in school."
[EP-tech] ePrints and LDAP Authentication
From: William Nixon <W.Nixon AT lib.gla.ac.uk>
Date: Tue, 20 Mar 2001 16:17:52 -0000
| Threading: | • This Message → Re: [EP-tech] ePrints and LDAP Authentication from support AT eprints.org |
Chris hi, Thanks for the very useful overview of ePrints, including the back history. Glad that you have garnered so much interest. Here at the University of Glasgow we have installed ePrints v.1.0 and are looking at the possibilities which it offers us. One [currently missing] element which we are very interested in though is the possibility of ePrints using the directory protocol LDAP to authenticate users via our existing user databases on campus. Either that or other alternatives which would allow them to use existing usernames and passwords rather than creating new ones - and so adding to the very long list of logins which they already use. Is this something which other institutions would also be interested, how critical do they see the issue of authentication? How do they currently deal with their ePrints authentication? William J Nixon == William J Nixon, Deputy Head of IT Services / Project Co-ordinator Glasgow University Library, Hillhead Street, Glasgow, G12 8QE, Scotland, UK e-mail: w.j.nixon AT lib.gla.ac.uk www: http://www.lib.gla.ac.uk/staff/wnixon tel: +44 (0)141 330 6721 fax: +44 (0)141 330 4952
[EP-tech] State of the system
From: ePrints Support <support AT eprints.org>
Date: Wed, 7 Mar 2001 18:46:56 +0000
Hi, and welcome to the list.
There are about 50 people signed up! I'm quite surprised (and pleased).
Here comes the brain dump. Feel free to comment and ask questions
but PLEASE DON'T QUOTE THIS WHOLE MESSAGE - it's really long - just
quote the bit you're interested in.
<BRAINDUMP BEGINS>
back history:
eprints.org 1.0 was written by Rob Tansley who has since left Southampton
University.
Since then I, Christopher Gutteridge, have been looking after the code.
I familiarised myself with the system and made a few minor changes to it,
this was eprints 1.1.
Now I'm working on adding my expertise to the system - some things were
OK but could be much better.
From the point of view of design and time, the biggest change is changing
the SQL back end. Previously the database stored each record as a single
row in a table in the database. Names were stored as a colon (:) seperated
list in a VARCHAR(255) - If the names of all the authors of a record got
bigger than 255 problems will occur.
The solution to this is to make ANY field possible to be multiple in which
case it is stored in a seperate table rather than the main table. For
example in the default configuration, subjects are now stored in a
seperate table:
| Field | Type
+----------+--------------------
| eprintid | varchar(255)
| pos | int(10) unsigned
| subjects | varchar(255)
pos isn't relevant to all multiple types, but for authors the order
really matters.
names are now stored as two columns: family & given - these are the
most locale-neutral descriptions I could find - they could still change.
the "indexed" option is no longer an option. Everything will be ↵
indexed.
This will speed up searches and slow down updates - which I think is fine.
URL,EMAIL,TEXT and MULTITEXT fields will have all the identifiable words
indexed in a seperate table, so that searching for records which contain
"foo" just looks it up in this table. MySQL's own freetext searching ↵
is
not yet mature enough - maybe we will be able to use it later.
Identifying the words will be done by a function in SiteRoutines so that
you can tweak it for your own needs. This function also returns a list
of words which were ignored (too short or too common eg. "a", ↵
"the" ).
ALL access to the database is now done via the search expression module
(except complete dumps). There is now no way to retrieve certain columns
from the database - you just get the whole lot.
I'm considering a "map table" method which will create a object for
every record in a table (archive, users, subscriptions) and then
apply a passed in subroutine to it - thus meaning scripts which process
the whole data don't have to dump it all into memory at once.
Searching got a whole lot more complex with N tables to search for a
given request - I tried doing the whole lot at one go by generating as
long as needs be SQL requests - the MySQL optimiser didn't do a very
good job so now it performs the search part by part using temporary
tables - which is a performance hit on small databases, but an improvement
on large ones. The system now also uses EXPLAIN, which is virtually a free
function (in terms of speed), to decide which order to do the search in.
For large result sets just reading all the data from disk takes some time
(testing on my desktop machine which is also running X, netscape and an mp3
player so it won't be my final benchmarking system) There is a point
at which I have a list of all the id's of records to retrieve but havn't got
them yet- at this point if this list is longer than an admin-configurable
limit (say 999) it will just return the first 1000 off the pile, unsorted. It
will clearly warn you that it has done this. I will probably make this
optional - but if someone searches for, say, all records before 2020 it would
dump the entire DB - possibly very big, and time consuming.
For reference my tests so far have been done on an eprints database of
50000 records.
I'm planning to combine/remove and create some datatypes.
roughly
"subjects" renames to "subject" (may be multiple of course)
"multiurl" goes. Use url, multiple=yes instead.
"multitext" gets a better name, not sure what yet.
"enum" goes. It is just a non multiple set and will be treated as ↵
such.
New fields: (Names are working titles only)
"textwithid" combination of a text field and an "ID" field ↵
which is
effectively just another text field, but strictly associated
with this text field - eg. You may want a textwithid field
called book where "text" is the bookname and ↵
"id" is the ISBN.
"namewithid" same concept but for people, I want to be able to ↵
uniquly
identify people in the system - I'm not recommending a ID system, but using ↵
the eprints username will work OK - the "editor" can
create non-login "user" entries for people not already in the
system.
I'm also considering making a system to create user views like subject view
so people can link to their publications. I would also like to generate
non-wrapped HTML versions of the views so these can be harvested into info
pages on users. See http://www.ecs.soton.ac.uk/info/people/swh - the list
of publications is imported from our current database (not eprints, but
we will move over as part of the testing for this new eprints version).
The goal of the id fields is to be ready for the future when things
will be more cross referenced. In a world of 6 billion+ people, just
a name dosn't indentify you for sure anymore. This isn't a solved problem
but we want to get closer.
For people, you may want to identify them on a record by name and
local eprints username, and then store one, or more!, ways of identifying
them as part of the user record.
Another thing: a way to import an old archive as XML:
<RECORD>
<TEXT field="eprintid">demo1</TEXT>
<TEXT field="username">cjg</TEXT>
<TEXT field="title">Title of paper (#1)</TEXT>
<EPRINTTYPE field="type">confpaper</EPRINTTYPE>
<YEAR field="year">1952</YEAR>
<MULTITEXT field="abstract">da abstract</MULTITEXT>
<TEXT field="conference">conference goes ↵
here!</TEXT>
<TEXT field="succeeds"></TEXT>
<TEXT field="commentary"></TEXT>
<NAME field="authors">
<FAMILY>Polden</FAMILY>
<GIVEN>Neil</GIVEN>
</NAME>
<NAME field="editors">
<FAMILY>Schilhabel</FAMILY>
<GIVEN>Jude</GIVEN>
</NAME>
<NAME field="editors">
<FAMILY>Rosinger</FAMILY>
<GIVEN>James</GIVEN>
</NAME>
<SUBJECTS field="subjects">arts-flms</SUBJECTS>
<SUBJECTS field="subjects">arts-fnar</SUBJECTS>
</RECORD>
This is an example of XML which is recognised by my development version. The
idea is to make it easy to (a) transfer from another system to using eprints,
and (b) to make it easy to populate an eprints system with a whole load of
data for testing purposes.
The next patch version of 1.1 (there are still a few minor tweaks which
could be made and it's OAI1.0 compliant, but not *robustly* compliant -
don't worry - it just dosn't 404 when it should - dosn't matter in day
to day issues). Anyway, the next patch will include a script which will
dump a 1.1 eprints archive into the above XML format (or the finalised
version) so that when/if you upgrade to 1.2 (or 2.0?) you can just import
the old data. This is the best way I could think of moving between the
two strutures. Although I don't want people using this for anything else
as non-standard export methods only cause trouble long term.
--
Thinking about mirroring methods - currently I'm thinking about
using MySQL's own mirroring system & rdist...
--
"Subscriptions" for staff to the Submissions Buffer so they get ↵
updated
when certain things come in.
---
Internationalisation:
This is a biggy and a lot of people care about it.
I've made the first stab at this but have delayed it until the bulk of
code re-design is done, or people will have to translate things which
I then go and change or delete all together.
Currently the plan is to make a config file for each language to translate
the code. A cookie can control which language is shown.
This same cookie can be used with mod rewrite to display a different static
directory depending on the language.
The "help" in the config file for each field can be set for each ↵
language -
and the SiteInfo file will need to have internationalisation changes - I've
not worried to much about this yet.
The Subjects list presents some problems. I'm currently planning NOT to
provide a way of making this available in any langauge other than that
of the archive - which is the language most of the data will be anyway.
---
Look and feel
I plan to "tart up" the default look to an eprints archive and use
more stylesheets and less <CENTER> tags. Other than that I've not been
thinking about this too much as most people will largely change it.
---
I'd like a way to make some formats "Private" eg. You make the PDF ↵
and
Postscript public but only certain people can download the origional latex.
I've got a few ideas - one is to put in a .htaccess file to control apache,
the other is to PGP encrypt the file. This is still very up in the air.
---
And that's more or less everything.
I've been, am, and expect to be, pretty busy.
I look forward to peoples comments...
--
Christopher Gutteridge support AT eprints.org
ePrints Technical Support +44 23 8059 4833
[EP-tech] First Message
From: ePrints Support <support AT eprints.org>
Date: Tue, 6 Mar 2001 17:46:38 +0000
Welcome to the eprints.org technicial list. -- Christopher Gutteridge support AT eprints.org ePrints Technical Support +44 23 8059 4833
[index] [prev] [next] [options] [help]




