Tech List

[index] [prev] [next] [options] [help]
See the Mailing Lists Page for how to subscribe and unsubscribe.

eprints_tech messages

Please note: this page shows emails that have been sent to the eprints_tech mailing list. Some of these may be spam emails we have failed to filter.

[EP-tech] Other little things...

From: ePrints Support <support AT eprints.org>
Date: Mon, 9 Apr 2001 10:45:29 +0100




I've decided to explicitly support ISO-LATIN-1, I know that OAI uses
UTF8 but if I don't limit my scope of features I'll never finish.

I'm thinking of redesigning the web-based registration. Well in fact, I'm 
going to, but I'm thinking about how.

I'm thinking of having a page which asks you to enter your email, the system
then generates a random password for that email, sets the username to 
be the SAME as the email and mails that person their password. This will
close the can of worms which is the process_mail script, which I think has
been the single biggest support problem.

This also adds the neat feature, that if you have "local" and 
"signup"
users, eg. members of your dept. and just random people off the web, then
you know that the "signup" users, will have an " AT " in 
their username which
means you can never have a problem I've been worried about which is:

Our department allocates usernames to it's users, we want them to be able
to use the same username/password as for email and logins.

Problem: If someone signs up on the web for a username which is free, but 
then the dept. admin allocate that username internally 3 months later - not
to be able to give a member of our dept. a username because someone on the
eprints system is already using it is a case of the Tail wagging the dog.

Hence the email address solution above is a nice solution. This dons't mean
local users couldn't all log in via email address too - email addresses have
the nice feature of being unique (well, most of the time).


I'm toying with removing process_mail from the next minor release of eprints1.1
as it's causing a lot of pain - I would make the "username" requested 
be
the email address minus everything after the " AT ".

Would people think that this is great or the work of the devil?

For a demonstration of how this system would work, the following web 
"service"
has been running for some time without serious problem, feel free to "poke 
it"
if you want to how the web based signup would work (ish): 
http://totl.net/VCash/




-- 

 Christopher Gutteridge                   support AT eprints.org 
 ePrints Technical Support                +44 23 8059 4833

[EP-tech] Slowly but surely

From: ePrints Support <support AT eprints.org>
Date: Mon, 9 Apr 2001 10:32:03 +0100




Latest things I've been working on:

XHTML, stylesheets & DOM.

I've been changing the internals of eprints so that rather than generating
a webpage by doing:

print "<P>"; 
print "My Cat";
print "</P>"

The system now builds the entire page as an xml tree then prints it:

use XML::DOM;
$page = new XML::DOM::Document;
$p = $page->createElement( "p" );
$cat = $page->createText( "My Cat" );
$p->appendChild( $cat );
$page->appendChild( $p );

OK, that dosn't produce a legal page (no <html> etc) but you get the 
idea.

XHTML is, in a nutshell HTML represented in XML (which can still be parsed
by current browsers). The practical differences are that all element and 
attribute names must be lower case, and that ALL elements must be closed. Eg.
<BR></BR>
which can be abbv'd. to:
<BR />

The template for pages in the config module must be in xhtml, as the system
parses it into a tree when it starts up (not every page request).

This is slightly harder, but means that the system will always produce 
well formed pages. Which is nice.

I've been removing as much markup as I can from the HTML (XHTML now!) produced
by eprints, and replacing it with class="foo" attributes to each 
part, so that
the admin can control the look of the generated pages without having to hack
at the code.

I've changed the way "citation configuration" works - this is the way 
in which
a record is rendered into a single string, for search result pages etc.

The new system is, shock!, XML - or more accurately an XHTML fragment with
two extra elements: <IF> and <FIELD> (note these *are* uppercase so 
they
are clearly distinct from the XHTML elements.

Eg. for conf paper:

<FIELD name="authors"/> <IF 
name="year">(<FIELD name="year"/>) 
</IF><FIELD name="title"/>. In <IF 
name="editors"><FIELD name="editors"/>, 
Eds. </IF><IF name="conference"><i>Proceedings 
<FIELD name="conference" /></i></IF><IF 
name="volume"> 
<b><FIELD name="volume" /></b></IF><IF 
name="number">
(<FIELD name="number" />)</IF><IF 
name="pages">, pages <FIELD name="pages" />
</IF><IF name="confloc">, <FIELD 
name="confloc" /></IF>.

I know this is more verbose but it allows control of exactly what you want
rather than the old system which was yet another config system.

When rendering the system first removes all <IF> elements, leaving their
contents behind IF the named field is not empty in this record, then the
system replaces <FIELD> tags with the value of the named field.

I'm considering doing something similar with the language configuration files,
using XML.

-- 

 Christopher Gutteridge                   support AT eprints.org 
 ePrints Technical Support                +44 23 8059 4833

Re: [EP-tech] Problems with OO Design

From: Clayton Carter <crcarter AT cs.indiana.edu>
Date: Mon, 26 Mar 2001 15:10:59 -0500


Threading: [EP-tech] Problems with OO Design from crcarter AT cs.indiana.edu
      • This Message


	This is good to hear.  The more I learn about the next
release, the more excited I get about it.  Who ever said serious
software can't be cool?  :)

--pc

On Fri, Mar 23, 2001 at 12:41:58PM +0000, ePrints Support wrote:
> Although under normal circumstances, I'd take a patch and say thankyou,
> the current development version of EPrints looks VERY different.
> 
> I've been slowly phasing out variables like the one you mention below, 
> I'll make even more of an effort now.
> 
> I would like to make a read-only version of the CVS available at some
> point but at the moment there's a lot of 'scaffold' to make things work
> as I rewrite various sections. I'd spend all my time helping people
> get it working.
> 
> One neat little change is that it will load all the modules when apache
> starts, NOT each time it spawns a sub process.
> 
> Also SiteInfo, SiteRoutine etc are going to be rolled into one module,
> which will be named after the id of your site eg.
> EPrintSite/cogprints.pm
> 
> This module will provide an object which will represent the site 
configuration
> including references to methods for validation, rendering etc.
> 
> It decides what config module to load based on the host and path of the 
URL
> request and does some reflection, which perl is good at.
> 
> 
> On Thu, Mar 22, 2001 at 03:26:10PM -0500, Clayton Carter wrote:
> 

-- 
Clayton Carter   crcarter AT cs.indiana.edu
"My mom says I'm the handsomest guy in school."

Re: [EP-tech] Problems with OO Design

From: ePrints Support <support AT eprints.org>
Date: Fri, 23 Mar 2001 16:55:03 +0000


Threading: [EP-tech] Problems with OO Design from crcarter AT cs.indiana.edu
      • This Message


Enough already :-)

I've set up a nightly script which knocks together a tar file of 
the code. This is available from http://www.ecs.soton.ac.uk/~cjg/eprints/

For those of you not familiar with development code, don't be surprised
if it (A) dosn't work and (B) isn't quite as well commented as the release
version...



On Fri, Mar 23, 2001 at 02:39:10PM -0000, Tim Brody wrote:
> > I would like to make a read-only version of the CVS available at some
> > point but at the moment there's a lot of 'scaffold' to make things 
work
> > as I rewrite various sections. I'd spend all my time helping people
> > get it working.
> 
> Awww ... you just want to keep it to yourself don't you?
> 
> Seriously, does it really matter? After all, compiling from CVS is only 
for
> developers, if you want a working system you should download a release
> version...
> 
> (One of the reasons I've not got around to looking at ePrint internals is
> because it would involve downloading/installing, source code access via 
CVS
> obviates that need)
> 
> Regards,
> Tim.

-- 

 Christopher Gutteridge                   support AT eprints.org 
 ePrints Technical Support                +44 23 8059 4833

Re: [EP-tech] Problems with OO Design

From: "Tim Brody" <tdb198 AT soton.ac.uk>
Date: Fri, 23 Mar 2001 14:39:10 -0000


Threading: [EP-tech] Problems with OO Design from crcarter AT cs.indiana.edu
      • This Message


> I would like to make a read-only version of the CVS available at some
> point but at the moment there's a lot of 'scaffold' to make things work
> as I rewrite various sections. I'd spend all my time helping people
> get it working.

Awww ... you just want to keep it to yourself don't you?

Seriously, does it really matter? After all, compiling from CVS is only for
developers, if you want a working system you should download a release
version...

(One of the reasons I've not got around to looking at ePrint internals is
because it would involve downloading/installing, source code access via CVS
obviates that need)

Regards,
Tim.

Re: [EP-tech] Problems with OO Design

From: ePrints Support <support AT eprints.org>
Date: Fri, 23 Mar 2001 12:41:58 +0000


Threading: [EP-tech] Problems with OO Design from crcarter AT cs.indiana.edu
      • This Message


Although under normal circumstances, I'd take a patch and say thankyou,
the current development version of EPrints looks VERY different.

I've been slowly phasing out variables like the one you mention below, 
I'll make even more of an effort now.

I would like to make a read-only version of the CVS available at some
point but at the moment there's a lot of 'scaffold' to make things work
as I rewrite various sections. I'd spend all my time helping people
get it working.

One neat little change is that it will load all the modules when apache
starts, NOT each time it spawns a sub process.

Also SiteInfo, SiteRoutine etc are going to be rolled into one module,
which will be named after the id of your site eg.
EPrintSite/cogprints.pm

This module will provide an object which will represent the site configuration
including references to methods for validation, rendering etc.

It decides what config module to load based on the host and path of the URL
request and does some reflection, which perl is good at.


On Thu, Mar 22, 2001 at 03:26:10PM -0500, Clayton Carter wrote:
> 	Thanks for the note about the indexed searching.  That's great
> news and I'm glad to hear it.
> 
> 	I've now a request/complaint/rambling string of sentences.  We
> (at the Indiana University Digital Library Program) are really pleased
> with EPrints but, as usual, there are things we've need to change or
> tweak.  I've accomplished this in a very OO way (and a rather -- to my
> senses, at least -- elegant way) as such:
> 
> 	EPrints::SubmissionForm::update_from_subject_form() needed to
> be changed.  (Well, maybe it didn't *need* to be changed, but this is
> the solution I came up with.)
> 
> 	In order to do this, I moved SubmissionForm.pm to
> SubmissionFormTheirs.pm (perhaps `EP' or `Orig' would have been a
> better name) and did a global search replace in that file to change
> all occurances of `EPrints::SubmissionForm' to
> `EPrints::SubmissionFormTheirs'.  I then made a symlink from 
> `perl_lib/EPrints/SubmissionForm.pm to 
> `perl_lib/DLIB/SubmissionForm.pm' to help us keep all of our changes
> in one place.  `DLIB/SubmissionForm.pm' looked something like this:
> 
> package EPrints::SubmissionForm;
> 
>   ...
> 
> use base qw(EPrints::SubmissionFormTheirs);  # combined use and  AT ISA
> 
>   ...
> 
> sub update_from_subject_form
> { 
>   ... 
> }
> 
> 1;
> 
> 	This worked great, except for one problem.  A few of the
> scripts, like `cgi/staff/view_submission', refered directly to class
> variables stored in the old `EPrints::SubmissionForm' (now
> `EPrints::SubmissionFormTheirs').  The script broke because of the
> fact that my inheriting package didn't have the some globals defined.
> (In particular, the access to the `action_*' variables is what caused
> the problems.)  The short solution was to just copy all of the
> `action_*' and `stage_*' variables into my package.
> 
> 	Keep in mind that, while OO is -- for the most part -- old hat
> to me, I'm still getting used to Perl's OO stuff.  Basically, I'm
> wondering if it's at all on the plate to modify things like this for a
> more OO friendly interface.  And I mean this in the simplest of ways:
> providing a method which takes an action_type and returns the
> associated text.
> 
> 	I guess this isn't a big deal, but it would make the system
> that much easier to extend.  I don't think that this particular
> example would be that hard to implement, so if I put together a patch
> to make this change, what format would be prefered?  (I'm none too
> well aquainted with unified diffs.)
> 
> --pc
> 
> -- 
> Clayton Carter   crcarter AT cs.indiana.edu
> "My mom says I'm the handsomest guy in school."

-- 

 Christopher Gutteridge                   support AT eprints.org 
 ePrints Technical Support                +44 23 8059 4833

[EP-tech] Problems with OO Design

From: Clayton Carter <crcarter AT cs.indiana.edu>
Date: Thu, 22 Mar 2001 15:26:10 -0500


Threading:      • This Message
             Re: [EP-tech] Problems with OO Design from support AT eprints.org
             Re: [EP-tech] Problems with OO Design from tdb198 AT soton.ac.uk
             Re: [EP-tech] Problems with OO Design from support AT eprints.org
             Re: [EP-tech] Problems with OO Design from crcarter AT cs.indiana.edu


	Thanks for the note about the indexed searching.  That's great
news and I'm glad to hear it.

	I've now a request/complaint/rambling string of sentences.  We
(at the Indiana University Digital Library Program) are really pleased
with EPrints but, as usual, there are things we've need to change or
tweak.  I've accomplished this in a very OO way (and a rather -- to my
senses, at least -- elegant way) as such:

	EPrints::SubmissionForm::update_from_subject_form() needed to
be changed.  (Well, maybe it didn't *need* to be changed, but this is
the solution I came up with.)

	In order to do this, I moved SubmissionForm.pm to
SubmissionFormTheirs.pm (perhaps `EP' or `Orig' would have been a
better name) and did a global search replace in that file to change
all occurances of `EPrints::SubmissionForm' to
`EPrints::SubmissionFormTheirs'.  I then made a symlink from 
`perl_lib/EPrints/SubmissionForm.pm to 
`perl_lib/DLIB/SubmissionForm.pm' to help us keep all of our changes
in one place.  `DLIB/SubmissionForm.pm' looked something like this:

package EPrints::SubmissionForm;

  ...

use base qw(EPrints::SubmissionFormTheirs);  # combined use and  AT ISA

  ...

sub update_from_subject_form
{ 
  ... 
}

1;

	This worked great, except for one problem.  A few of the
scripts, like `cgi/staff/view_submission', refered directly to class
variables stored in the old `EPrints::SubmissionForm' (now
`EPrints::SubmissionFormTheirs').  The script broke because of the
fact that my inheriting package didn't have the some globals defined.
(In particular, the access to the `action_*' variables is what caused
the problems.)  The short solution was to just copy all of the
`action_*' and `stage_*' variables into my package.

	Keep in mind that, while OO is -- for the most part -- old hat
to me, I'm still getting used to Perl's OO stuff.  Basically, I'm
wondering if it's at all on the plate to modify things like this for a
more OO friendly interface.  And I mean this in the simplest of ways:
providing a method which takes an action_type and returns the
associated text.

	I guess this isn't a big deal, but it would make the system
that much easier to extend.  I don't think that this particular
example would be that hard to implement, so if I put together a patch
to make this change, what format would be prefered?  (I'm none too
well aquainted with unified diffs.)

--pc

-- 
Clayton Carter   crcarter AT cs.indiana.edu
"My mom says I'm the handsomest guy in school."

Re: [EP-tech] ePrints and LDAP Authentication

From: ePrints Support <support AT eprints.org>
Date: Thu, 22 Mar 2001 11:18:51 +0000


Threading: [EP-tech] ePrints and LDAP Authentication from W.Nixon AT lib.gla.ac.uk
      • This Message



I've been working on the ground work for this, this morning.

We have a similar situation - I want local users to use their encrypted
password and UNIX username, but I still want external users to be able
to sign up for subscriptions.

I'm currently thinking that in this situation, non-"local" users will
be forced to have a prefix to their userid (eg " AT ") to make sure 
they
don't cross into the same range as the local users.

I'm thinking of being able to have N different kinds of users, rather
than the current "Staff" and "User". Each user type will 
have permission
to use various features:
	basic user stuff (password change, set info)
	subscriptions
	deposit papers
	staff searches & siteinfo
	editorial (approving/rejecting items)
	admin - modifying other users etc.

and each type of user will have their own authentication method provided
as a call back to a mod_perl authentication module authentication handler.

For my example, I would have "Staff", "Local" and 
"Guest" types with differ-
ent privs. in the system. All 3 would use the Apache::AuthDBI::authen
method but "Guest" would set the enrypted password to "off" 
and "Staff" and
"Local" would set encrypted to "on".

This section would also configure if a usertype was allowed to set their 
password or not - it dosn't make sense for users who are having their 
eprints account initialised from an external source to be able to change their
password.

How does this apply to LDAP? If all goes to plan, your local users can use
the perl LDAP authentication module instead, and totally ignore the password
field in eprints, although you will still need to dump info into the database
from the LDAP system for things like their name, email etc.

This is all in development and suggestions, feedback, abuse etc are
welcome.


	

On Tue, Mar 20, 2001 at 04:17:52PM -0000, William Nixon wrote:
> Chris hi,
> 
> Thanks for the very useful overview of ePrints, including the back 
history.
> Glad that you have garnered so much interest.
> 
> Here at the University of Glasgow we have installed ePrints v.1.0 and are
> looking at the possibilities which it offers us. 
> 
> One [currently missing] element which we are very interested in though is
> the possibility of ePrints using the directory protocol LDAP to 
authenticate
> users via our existing user databases on campus. Either that or other
> alternatives which would allow them to use existing usernames and 
passwords
> rather than creating new ones - and so adding to the very long list of
> logins which they already use.
> 
> Is this something which other institutions would also be interested, how
> critical do they see the issue of authentication? How do they currently 
deal
> with their ePrints authentication?
> 
> William J Nixon
> 
> ==
> William J Nixon, Deputy Head of IT Services / Project Co-ordinator
> Glasgow University Library, Hillhead Street, Glasgow, G12 8QE, Scotland, 
UK
> e-mail:	w.j.nixon AT lib.gla.ac.uk	www:
> http://www.lib.gla.ac.uk/staff/wnixon
> tel:	+44 (0)141 330 6721	fax:	+44 (0)141 330 4952

-- 

 Christopher Gutteridge                   support AT eprints.org 
 ePrints Technical Support                +44 23 8059 4833

Re: [EP-tech] Accented Characters

From: ePrints Support <support AT eprints.org>
Date: Tue, 20 Mar 2001 21:07:26 +0000


Threading: [EP-tech] Accented Characters from crcarter AT cs.indiana.edu
      • This Message


aha. The new version I'm working on keeps an index of words which 
are in each record.

To improve matching it removes 's' from the end of an index word
and some other bits and bobs. When you search for some words it
then uses the same metrics. Unless you ask for an exact search, a
search for "beasts" will look up "beast" in the word index.

How is this related? When I put things into the word index I'll
will appear in searches for 'Gödel' and for 'Godel' (and Godels 
for that matter).

In fact the routine which decides which words should be indexed,
and as what, is in the SiteRoutines so that you can tweak it to
the needs of your archive.

Here's the current version, it'll probably change a bit before I'm
done. For example I plan to remove all accents before indexing the
words. 

Sound good?

######################################################################
#
# extract_words( $text )
#
#  This method is used when indexing a record, to decide what words
#  should be used as index words.
#  It is also used to decide which words to use when performing a
#  search. 
#
#  It returns references to 2 arrays, one of "good" words which 
should
#  be used, and one of "bad" words which should not.
#
######################################################################

sub extract_words
{
	my( $text ) =  AT _;
	
	# Remove single quotes so "don't" becomes "dont"
	$text =~ s/'//g;

	# Normalise acronyms eg.
	# The F.B.I. is like M.I.5.
	# becomes
	# The FBI  is like MI5
	my $a;
	$text =~ s#[A-Z0-9]\.([A-Z0-9]\.)+#$a=$&;$a=~s/\.//g;$a#ge;

	# Remove hyphens from acronyms
	$text=~ s#[A-Z]-[A-Z](-[A-Z])*#$a=$&;$a=~s/-//g;$a#ge;

	# Replace any non alphanumeric characters with a space instead
	$text =~ s/[^a-zA-Z0-9]/ /g;

	# Iterate over every word (space seperated values) 
	my  AT words = split  /\s+/ , $text;
	# We use hashes rather than arrays at this point to make
	# sure we only get each word once, not once for each occurance.
	my %good = ();
	my %bad = ();
	foreach(  AT words )
	{	
		# skip if this is nothing but whitespace;
		next if /^\s*$/;

		# calculate the length of this word
		my $wordlen = length $_;

		# $ok indicates if we should index this word or not

		# First approximation is if this word is over or equal
		# to the minimum size set in SiteInfo.
		my $ok = $wordlen >= $EPrintSite::SiteInfo::freetext_min_word_size;
	
		# If this word is at least 2 chars long and all capitals
		# it is assumed to be an acronym and thus should be indexed.
		if( m/^[A-Z][A-Z0-9]+$/ )
		{
			$ok=1;
		}

		# Consult list of "never words". Words which should never
		# be indexed.	
		if( $EPrintSite::SiteInfo::freetext_never_words{lc $_} )
		{
			$ok = 0;
		}
		# Consult list of "always words". Words which should always
		# be indexed.	
		if( $EPrintSite::SiteInfo::freetext_always_words{lc $_} )
		{
			$ok = 1;
		}
	
		# Add this word to the good list or the bad list
		# as appropriate.	
		if( $ok )
		{
			# Only "bad" words are used in display to the
			# user. Good words can be normalised even further.

			# non-acronyms (ie not all UPPERCASE words) have
			# a trailing 's' removed. Thus in searches the
			# word "chair" will match "chairs" and vice-versa.
			# This isn't perfect "mose" will match "moses" and
			# "nappy" still won't match "nappies" but it's a
			# reasonable attempt.
			s/s$//;

			# If any of the characters are lowercase then lower
			# case the entire word so "Mesh" becomes "mesh" but
			# "HTTP" remains "HTTP".
			if( m/[a-z]/ )
			{
				$_ = lc $_;
			}
	
			$good{$_}++;
		}
		else 
		{
			$bad{$_}++;
		}
	}
	# convert hash keys to arrays and return references
	# to these arrays.
	my(  AT g ) = keys %good;
	my(  AT b ) = keys %bad;
	return( \ AT g , \ AT b );
}


On Tue, Mar 20, 2001 at 02:39:58PM -0500, Clayton Carter wrote:
> Hi All,
> 
> 	I've a few questions and/or concerns related to diacritics.
> Primarily, I'm concerned with how they're handled for the database,
> search and display mechanisms.
> 
> 	Currently, it looks as though the characters put into the HTML
> form are copied directly into the database and then displayed `as is'
> in the HTML.  I say `as is' because they aren't translated into any
> funny looking HTML escape sequences like &#214; .  Also, when
> searching for words containing accents, one must specify the exact
> character sequence (complete with accents) in order to score a hit.
> 
> 	I can certainly see how/why this is considered correct
> behavior, but I was wondering if there was any more mature
> implementation planned?  I've been told that this is some sort of
> outstanding problem in library databases, but that the usual way to
> deal with this problem is to put a nonprinting character in front of
> the accented character that tells the system a) that the next
> character is an accented character and b) what a vaild substitute
> would be.  So, as I understand it, the database would store something
> like `Goödel' and searchs would be able to match both `Godel' and
> `Gödel', but the webpages would always be able to display `Gödel'.
> (Forgive me if these characters don't come out right.  I'm not
> accustomed to have to use them.)
> 
> 	That said, is there anything of that sort (or something else,
> perhaps WAY better than I could possibly ever think of) in the works?
> Does anyone running the software have any kind of policy toward
> accented characters or specials instruction for users utilizing these
> characters?  Does anyone have any suggestions about how to better deal
> with this issue?
> 
> 	Thanks!  Let me know if I've not made myself clear.
> 
> --pc
> 
> PS - I also want to thank Chris for the braindump.  The background
> info is interesting and I'm excited at some of the features being
> mulled over and worked on.
> 
> -- 
> Clayton Carter   crcarter AT cs.indiana.edu
> "My mom says I'm the handsomest guy in school."

-- 

 Christopher Gutteridge                   support AT eprints.org 
 ePrints Technical Support                +44 23 8059 4833

[EP-tech] Accented Characters

From: Clayton Carter <crcarter AT cs.indiana.edu>
Date: Tue, 20 Mar 2001 14:39:58 -0500


Threading:      • This Message
             Re: [EP-tech] Accented Characters from support AT eprints.org


Hi All,

	I've a few questions and/or concerns related to diacritics.
Primarily, I'm concerned with how they're handled for the database,
search and display mechanisms.

	Currently, it looks as though the characters put into the HTML
form are copied directly into the database and then displayed `as is'
in the HTML.  I say `as is' because they aren't translated into any
funny looking HTML escape sequences like &#214; .  Also, when
searching for words containing accents, one must specify the exact
character sequence (complete with accents) in order to score a hit.

	I can certainly see how/why this is considered correct
behavior, but I was wondering if there was any more mature
implementation planned?  I've been told that this is some sort of
outstanding problem in library databases, but that the usual way to
deal with this problem is to put a nonprinting character in front of
the accented character that tells the system a) that the next
character is an accented character and b) what a vaild substitute
would be.  So, as I understand it, the database would store something
like `Goödel' and searchs would be able to match both `Godel' and
`Gödel', but the webpages would always be able to display `Gödel'.
(Forgive me if these characters don't come out right.  I'm not
accustomed to have to use them.)

	That said, is there anything of that sort (or something else,
perhaps WAY better than I could possibly ever think of) in the works?
Does anyone running the software have any kind of policy toward
accented characters or specials instruction for users utilizing these
characters?  Does anyone have any suggestions about how to better deal
with this issue?

	Thanks!  Let me know if I've not made myself clear.

--pc

PS - I also want to thank Chris for the braindump.  The background
info is interesting and I'm excited at some of the features being
mulled over and worked on.

-- 
Clayton Carter   crcarter AT cs.indiana.edu
"My mom says I'm the handsomest guy in school."

[EP-tech] ePrints and LDAP Authentication

From: William Nixon <W.Nixon AT lib.gla.ac.uk>
Date: Tue, 20 Mar 2001 16:17:52 -0000


Threading:      • This Message
             Re: [EP-tech] ePrints and LDAP Authentication from support AT eprints.org


Chris hi,

Thanks for the very useful overview of ePrints, including the back history.
Glad that you have garnered so much interest.

Here at the University of Glasgow we have installed ePrints v.1.0 and are
looking at the possibilities which it offers us. 

One [currently missing] element which we are very interested in though is
the possibility of ePrints using the directory protocol LDAP to authenticate
users via our existing user databases on campus. Either that or other
alternatives which would allow them to use existing usernames and passwords
rather than creating new ones - and so adding to the very long list of
logins which they already use.

Is this something which other institutions would also be interested, how
critical do they see the issue of authentication? How do they currently deal
with their ePrints authentication?

William J Nixon

==
William J Nixon, Deputy Head of IT Services / Project Co-ordinator
Glasgow University Library, Hillhead Street, Glasgow, G12 8QE, Scotland, UK
e-mail:	w.j.nixon AT lib.gla.ac.uk	www:
http://www.lib.gla.ac.uk/staff/wnixon
tel:	+44 (0)141 330 6721	fax:	+44 (0)141 330 4952

[EP-tech] State of the system

From: ePrints Support <support AT eprints.org>
Date: Wed, 7 Mar 2001 18:46:56 +0000




Hi, and welcome to the list.

There are about 50 people signed up! I'm quite surprised (and pleased).

Here comes the brain dump. Feel free to comment and ask questions
but PLEASE DON'T QUOTE THIS WHOLE MESSAGE - it's really long - just
quote the bit you're interested in.


<BRAINDUMP BEGINS>

back history:

eprints.org 1.0 was written by Rob Tansley who has since left Southampton
University. 

Since then I, Christopher Gutteridge, have been looking after the code.

I familiarised myself with the system and made a few minor changes to it,
this was eprints 1.1.

Now I'm working on adding my expertise to the system - some things were
OK but could be much better.

From the point of view of design and time, the biggest change is changing
the SQL back end. Previously the database stored each record as a single
row in a table in the database. Names were stored as a colon (:) seperated
list in a VARCHAR(255) - If the names of all the authors of a record got
bigger than 255 problems will occur.

The solution to this is to make ANY field possible to be multiple in which 
case it is stored in a seperate table rather than the main table. For 
example in the default configuration, subjects are now stored in a 
seperate table:

| Field    | Type          
+----------+--------------------
| eprintid | varchar(255)   
| pos      | int(10) unsigned
| subjects | varchar(255)    

pos isn't relevant to all multiple types, but for authors the order
really matters.

names are now stored as two columns: family & given - these are the 
most locale-neutral descriptions I could find - they could still change.

the "indexed" option is no longer an option. Everything will be 
indexed.
This will speed up searches and slow down updates - which I think is fine.

URL,EMAIL,TEXT and MULTITEXT fields will have all the identifiable words
indexed in a seperate table, so that searching for records which contain
"foo" just looks it up in this table. MySQL's own freetext searching 
is
not yet mature enough - maybe we will be able to use it later.

Identifying the words will be done by a function in SiteRoutines so that
you can tweak it for your own needs. This function also returns a list
of words which were ignored (too short or too common eg. "a", 
"the" ).

ALL access to the database is now done via the search expression module
(except complete dumps). There is now no way to retrieve certain columns
from the database - you just get the whole lot.

I'm considering a "map table" method which will create a object for 
every record in a table (archive, users, subscriptions) and then 
apply a passed in subroutine to it - thus meaning scripts which process
the whole data don't have to dump it all into memory at once.

Searching got a whole lot more complex with N tables to search for a 
given request - I tried doing the whole lot at one go by generating as
long as needs be SQL requests - the MySQL optimiser didn't do a very
good job so now it performs the search part by part using temporary 
tables - which is a performance hit on small databases, but an improvement
on large ones. The system now also uses EXPLAIN, which is virtually a free
function (in terms of speed), to decide which order to do the search in.

For large result sets just reading all the data from disk takes some time
(testing on my desktop machine which is also running X, netscape and an mp3
player so it won't be my final benchmarking system) There is a point
at which I have a list of all the id's of records to retrieve but havn't got
them yet- at this point if this list is longer than an admin-configurable 
limit (say 999) it will just return the first 1000 off the pile, unsorted. It
will clearly warn you that it has done this. I will probably make this 
optional - but if someone searches for, say, all records before 2020 it would
dump the entire DB - possibly very big, and time consuming.

For reference my tests so far have been done on an eprints database of 
50000 records.

I'm planning to combine/remove and create some datatypes.

roughly

"subjects" renames to "subject" (may be multiple of course)
"multiurl" goes. Use url, multiple=yes instead.
"multitext" gets a better name, not sure what yet.
"enum" goes. It is just a non multiple set and will be treated as 
such.

New fields: (Names are working titles only)
"textwithid" combination of a text field and an "ID" field 
which is
            effectively just another text field, but strictly associated
            with this text field - eg. You may want a textwithid field
           called book where "text" is the bookname and 
"id" is the ISBN.
"namewithid" same concept but for people, I want to be able to 
uniquly
	identify people in the system - I'm not recommending a ID system, 		but using 
the eprints username will work OK - the "editor" can
	create non-login "user" entries for people not already in the
	system.

I'm also considering making a system to create user views like subject view
so people can link to their publications. I would also like to generate
non-wrapped HTML versions of the views so these can be harvested into info
pages on users. See http://www.ecs.soton.ac.uk/info/people/swh - the list
of publications is imported from our current database (not eprints, but
we will move over as part of the testing for this new eprints version).

The goal of the id fields is to be ready for the future when things
will be more cross referenced. In a world of 6 billion+ people, just 
a name dosn't indentify you for sure anymore. This isn't a solved problem
but we want to get closer.

For people, you may want to identify them on a record by name and 
local eprints username, and then store one, or more!, ways of identifying
them as part of the user record.

Another thing: a way to import an old archive as XML:
<RECORD>
      <TEXT field="eprintid">demo1</TEXT>
      <TEXT field="username">cjg</TEXT>
      <TEXT field="title">Title of paper (#1)</TEXT>
      <EPRINTTYPE field="type">confpaper</EPRINTTYPE>
      <YEAR field="year">1952</YEAR>
      <MULTITEXT field="abstract">da abstract</MULTITEXT>
      <TEXT field="conference">conference goes 
here!</TEXT>
      <TEXT field="succeeds"></TEXT>
      <TEXT field="commentary"></TEXT>
      <NAME field="authors">
        <FAMILY>Polden</FAMILY>
        <GIVEN>Neil</GIVEN>
      </NAME>
      <NAME field="editors">
        <FAMILY>Schilhabel</FAMILY>
        <GIVEN>Jude</GIVEN>
      </NAME>
      <NAME field="editors">
        <FAMILY>Rosinger</FAMILY>
        <GIVEN>James</GIVEN>
      </NAME>
      <SUBJECTS field="subjects">arts-flms</SUBJECTS>
      <SUBJECTS field="subjects">arts-fnar</SUBJECTS>
</RECORD>

This is an example of XML which is recognised by my development version. The
idea is to make it easy to (a) transfer from another system to using eprints,
and (b) to make it easy to populate an eprints system with a whole load of
data for testing purposes.

The next patch version of 1.1 (there are still a few minor tweaks which 
could be made and it's OAI1.0 compliant, but not *robustly* compliant - 
don't worry - it just dosn't 404 when it should - dosn't matter in day
to day issues). Anyway, the next patch will include a script which will
dump a 1.1 eprints archive into the above XML format (or the finalised 
version) so that when/if you upgrade to 1.2 (or 2.0?) you can just import
the old data. This is the best way I could think of moving between the
two strutures. Although I don't want people using this for anything else
as non-standard export methods only cause trouble long term.

--

Thinking about mirroring methods - currently I'm thinking about
using MySQL's own mirroring system & rdist...

--

"Subscriptions" for staff to the Submissions Buffer so they get 
updated
when certain things come in.

---

Internationalisation:

This is a biggy and a lot of people care about it. 

I've made the first stab at this but have delayed it until the bulk of
code re-design is done, or people will have to translate things which
I then go and change or delete all together.

Currently the plan is to make a config file for each language to translate
the code. A cookie can control which language is shown.

This same cookie can be used with mod rewrite to display a different static
directory depending on the language.

The "help" in the config file for each field can be set for each 
language - 
and the SiteInfo file will need to have internationalisation changes - I've
not worried to much about this yet.

The Subjects list presents some problems. I'm currently planning NOT to 
provide a way of making this available in any langauge other than that
of the archive - which is the language most of the data will be anyway.

---

Look and feel

I plan to "tart up" the default look to an eprints archive and use 
more stylesheets and less <CENTER> tags. Other than that I've not been
thinking about this too much as most people will largely change it.

---

I'd like a way to make some formats "Private" eg. You make the PDF 
and
Postscript public but only certain people can download the origional latex.

I've got a few ideas - one is to put in a .htaccess file to control apache,
the other is to PGP encrypt the file. This is still very up in the air.

---

And that's more or less everything.

I've been, am, and expect to be, pretty busy.


I look forward to peoples comments...


-- 

 Christopher Gutteridge                   support AT eprints.org 
 ePrints Technical Support                +44 23 8059 4833

[EP-tech] First Message

From: ePrints Support <support AT eprints.org>
Date: Tue, 6 Mar 2001 17:46:38 +0000




Welcome to the eprints.org technicial list.

-- 

 Christopher Gutteridge                   support AT eprints.org 
 ePrints Technical Support                +44 23 8059 4833

[index] [prev] [next] [options] [help]