EPrints Technical Mailing List Archive

Message: #07650


< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First

Re: [EP-tech] hashes in EPrints


Hi Tomasz,

While doing some other work I thought I should check what you had asked:

mysql> select distinct hash_type from file ;
+-----------+
| hash_type |
+-----------+
| NULL      |
| MD5       |
+-----------+
2 rows in set (0.78 sec)

Interestingly we do have some dataobj.xml files with hashes.

mysql> select * from file limit 10;
+--------+-----------+----------+-------------------------+-----------------+----------------------------------+-----------+----------+------------+-------------+-----------+------------+--------------+--------------+
| fileid | datasetid | objectid | filename                | mime_type       | hash                             | hash_type | filesize | mtime_year | mtime_month | mtime_day | mtime_hour | mtime_minute | mtime_second |
+--------+-----------+----------+-------------------------+-----------------+----------------------------------+-----------+----------+------------+-------------+-----------+------------+--------------+--------------+
|      1 | history   |      759 | dataobj.xml             | text/xml        | NULL                             | NULL      |     2905 |       2014 |          11 |        18 |          3 |           10 |            3 |
|      2 | history   |      760 | dataobj.xml             | text/xml        | NULL                             | NULL      |     2906 |       2014 |          11 |        18 |          3 |           10 |            3 |
|      3 | history   |    11067 | dataobj.xml             | text/xml        | NULL                             | NULL      |     2898 |       2014 |          11 |        18 |          3 |           10 |            3 |
|      4 | history   |   237035 | dataobj.xml             | text/xml        | 190a918c2b50c4fffadf14b4cbafc356 | MD5       |     2980 |       2014 |          11 |        18 |          3 |           10 |            5 |
| 357331 | document  |    89290 | lightbox.jpg            | image/png       | f17591e6b90990ef577887dcf43ba677 | MD5       |    49089 |       2016 |           3 |        11 |          6 |           43 |           56 |
|      6 | history   |   237036 | dataobj.xml             | text/xml        | 681523cd92a6bd976fa0437cc6238629 | MD5       |     2589 |       2014 |          11 |        18 |          3 |           10 |            5 |
|      7 | document  |      286 | Broadband_published.pdf | application/pdf | NULL                             | NULL      |   439281 |       2014 |          11 |        18 |          3 |           10 |            5 |
|      8 | history   |      761 | dataobj.xml             | text/xml        | NULL                             | NULL      |     3075 |       2014 |          11 |        18 |          3 |           10 |            5 |
|      9 | history   |      762 | dataobj.xml             | text/xml        | NULL                             | NULL      |     3076 |       2014 |          11 |        18 |          3 |           10 |            5 |
|     10 | history   |    11068 | dataobj.xml             | text/xml        | NULL                             | NULL      |     3060 |       2014 |          11 |        18 |          3 |           10 |            5 |
+--------+-----------+----------+-------------------------+-----------------+----------------------------------+-----------+----------+------------+-------------+-----------+------------+--------------+--------------+
10 rows in set (0.00 sec)

I can’t explain (and haven’t looked in to) what causes the behaviour but there appears to be an equally inconsistent application of hashes in our database.

karl.

On 11 Jan 2019, at 9:25 am, Tomasz Neugebauer via Eprints-tech <eprints-tech@ecs.soton.ac.uk> wrote:

Happy New Year! 
 
I am reaching out to the list about this issue of MD5 vs SHA256 hashes in EPrints.
 
Based on digging into our database:
·         EPrints generates file.hash values of file.hash_type MD5 for the following derivative files:
o   indexcodes.txt
o   lightbox.jpg
o   preview.jpg
o   medium.jpg
o   small.jpg
·         EPrints generated file.hash value of hash_type MD5 for most (but not all) of the uploaded files (such as PDF files).  Some of the older PDF files do not have a file.hash value stored in our database.
·         EPrints did not generate any file.hash values for the dataobj.xml files for the history objects.
 
The EPrints source code includes a function to generate MD5 and SHA256, but it looks like only the MD5 is ever called by default.  
 
Are these findings consistent with what you have in your EPrints instance?
Since we have MD5 by default in EPrints, do you agree that MD5 will be sufficient for the export to Archivematica?
Does anyone know why some of our uploaded files would have no file.hash?  Is that something that could have been caused by a bug EPrints that prevented hashes to be generated, but that was resolved at some point?
 
Tomasz
 
 
 
________________________________________________

Tomasz Neugebauer
Digital Projects & Systems Development Librarian / Bibliot
hécaire des Projets Numériques & Développement de Systèmes
Library / Bibliothèque
Concordia University / Université Concordia

Tel. / Tél. 514-848-2424 ext. / poste 7738
Email / courriel: 
tomasz.neugebauer@concordia.ca

www.concordia.ca/faculty/tomasz-neugebauer.html

Mailing address / adresse postale: 1455 De Maisonneuve Blvd. W., LB-540-03, Montreal, Quebec H3G 1M8
Street address / adresse municipale: 1400 De Maisonneuve Blvd. W., LB-540-03, Montreal, Quebec H3G 1M8

library.concordia.ca

 
*** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
*** Archive: http://www.eprints.org/tech.php/
*** EPrints community wiki: http://wiki.eprints.org/
*** EPrints developers Forum: http://forum.eprints.org/

-- 
Karl Goetz,  Senior Library Officer (Library Systems)
University of Tasmania, Private Bag 25, Hobart 7001
Available Tuesday, Wednesday, Thursday



University of Tasmania Electronic Communications Policy (December, 2014).
This email is confidential, and is for the intended recipient only. Access, disclosure, copying, distribution, or reliance on any of it by anyone outside the intended recipient organisation is prohibited and may be a criminal offence. Please delete if obtained in error and email confirmation to the sender. The views expressed in this email are not necessarily the views of the University of Tasmania, unless clearly intended otherwise.