See the Mailing Lists Page for how to subscribe and unsubscribe.
eprints_tech messages
Please note: this page shows emails that have been sent to the eprints_tech mailing list. Some of these may be spam emails we have failed to filter.
Re: [EP-tech] security hole and expensive indexing
From: Tim Brody <tdb01r AT ecs.soton.ac.uk>
Date: Tue, 20 May 2008 10:33:27 +0100
| Threading: | ↑ [EP-tech] security hole and expensive indexing from roman.chyla AT gmail.com • This Message |
*** http://www.eprints.org/tech.php/id/%3C1211276007.8268.130.camel%40meteor%3E *** EPrints community wiki - http://wiki.eprints.org/ On Mon, 2008-05-19 at 17:38 +0200, Roman Chyla wrote: > *** ↵ http://www.eprints.org/tech.php/id/%3Cea0115e90805190838l99d250cs77fac5d279883758%40mail.gmail.com%3E > *** EPrints community wiki - http://wiki.eprints.org/ > > Hi Tim, > > > > > On Mon, 2008-05-19 at 14:47 +0200, Roman Chyla wrote: > >> *** ↵ http://www.eprints.org/tech.php/id/%3Cea0115e90805190547o4d84bdb2yec2625bbaf255415%40mail.gmail.com%3E > >> *** EPrints community wiki - http://wiki.eprints.org/ > >> > >> Hi, > >> i am in a process of fixing EPrints for us. I am hoping to give ↵ you > >> some feedback over time. This one is a severe security hole: > >> generally a very bad idea to do: > >> > >> $sql = "INSERT INTO $indextable (fieldword,pos,ids ) VALUES > >> ('$fieldword',$n,':$objectid:')" > >> $session->get_database->do( $sql ) > >> > >> should be done as: > >> > >> $sql = "INSERT INTO $indextable (fieldword,pos,ids ) VALUES ↵ (?,?,?)" > >> $session->get_database->do( $sql, $fieldword, $n, $objectid ↵ ) > >> > >> even if EPrints *sometimes* does the Database::prep_value($value) ↵ - > >> taht function does not care for anything but backslashes > > > > $fieldword goes through prep_value, which escapes quotes and > > backslashes: > > I think it is not the case here: > "UPDATE $indextable SET ids='$ids$objectid:' WHERE > fieldword='$fieldword' AND pos=$n" Goes through prep_value at the top of that code block? (3.0.5) > > $value =~ s/["\\']/\\$&/g; > > I am not sure if this is enough - there might be more to escape than > quotes and back slashes. And escaping is database/driver specific, so > in my opinion it is better to use the standard DBI quote function > $dbh->quote("Don't") Reading the MySQL manual NUL, \r, \n and EOF should also be escaped, which would be trivial to add. But I've not hit any problems with them (at worst you would cause a SQL error, but then you'd have to be using something weird to submit NUL and EOF anyway ...). 3.0.x only works with MySQL, so we don't have to worry about other databases. 3.1.x uses DBI->quote() and DBI->quote_identifier() > > You can't pass a value through prep_value (in MySQL) that will be > > executed. > > > > The only non-escaped values passed through should be integer i.e. ↵ will > > have come from an id field. > > > >> prone to this error: everything in EPrints, especially fulltext ↵ indexing > >> > >> ---- > >> and also, the indexing is very CPU intensive (for many reasons) ↵ one of > >> them is this: > >> > >> $session->get_database->do( $query) > >> > >> it would be much much better to do: > >> my $q = $dbh->prepare("statement") > >> > >> and then only in the loop > >> $q->execute() > > > > MySQL's driver only executes strings - the place holder substitution > > happens in Perl for every execution. prepare/execute and do() should ↵ be > > about the same complexity. > > considering that the add() is called for every field for every eprint, > it could make a difference to prepare statements once, cache them, and > use for every field. but yes, in theory and in some cases it is > necessary ->do(), because with DBI one cannot have placeholders for > table/column names, but If you want to benchmark you can replace do() with this: my $sth = $database->{dbh}->prepare_cached( "SQL WITH PLACE ↵ HOLDERS" ); $sth->execute( AT VALUES ); But unless the MySQL driver has changed from what I know I don't expect it to make any difference. > > I've rationalised a lot of the DB functionality in 3.1, including ↵ making > > use of DBI's place holder substitution, because that does make a > > difference to other DBs. > > > > I think you'll find the slowest parts of indexing are due to spawning > > external processes (for text extraction) and text parsing. > > of course, that counts, but i think some of these count too: > - combining field value with the field name in eprint__index (not in > eprint_rindex though) > that means when searching, you are probably searching for > _fulltext_:something instead of "select * from index where > col1=something and col2=fulltext_id" > nor can you spare all the rows for index that should match every > value, it is not possible to ask: "select * from index where > col1=something" > > i believe it is expensive to have that text field name in the value, > yet again, i have done no profiling I don't think EPrints ever looks for just the term - it's always a term in a specific field (ORed over multiple fields for a simple search). >From a database perspective I don't think it makes any difference whether you store two concated strings or an index over two columns. (The difference between an index using ':' as a separator and NUL) > - and instead of billion queries, you are issuing 2 billion queries > just updating the index > SELECT max(pos) FROM $indextable where fieldword='$fieldword' > SELECT ids FROM $indextable WHERE fieldword='$fieldword' AND pos=$n > > while it could be: > SELECT ids FROM $indextable WHERE fieldword='$fieldword' order by pos > desc limit 1; Yup, that looks like a plausable optimisation for MySQL. But LIMIT isn't supported by Oracle, so it would need some work. I'm not hugely happy with the indexing code, but I don't think (?) it's something that has limited people to date with scaling. 3.1 abstracts the database. 3.2 will abstract the storage layer. Perhaps in 3.3 we'll abstract the indexing engine and you can use something else! All the best, Tim.
[index] [prev] [next] [options] [help]




