[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[EP-tech] Crawler ends up with 404, dont know how to handle MIME subtype wildcard



CAUTION: This e-mail originated outside the University of Southampton.

Dear all

unfortunately one of our partner crawlers reports a 404 error during the download, The problem occurs when wildcards are used as mime subtype.

Here an example on our repo ZORA - let us try to get publication no. 143147 via CURL:

HTTP 200 status is returned, when
- no Accept header is specified: curl -v https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.zora.uzh.ch%2Fid%2Feprint%2F143147%2F&data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7Cde1a13ecf75d4d63ac0108d9500fd33b%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637628852052173981%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&sdata=BCEMsU6y3u2fsT1aGY3%2FqTfDm4zp%2F8uiDP2d9wqhpI8%3D&reserved=0
- an exact MIME type is specified: curl -v -H 'Accept: text/html' https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.zora.uzh.ch%2Fid%2Feprint%2F143147%2F&data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7Cde1a13ecf75d4d63ac0108d9500fd33b%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637628852052173981%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&sdata=BCEMsU6y3u2fsT1aGY3%2FqTfDm4zp%2F8uiDP2d9wqhpI8%3D&reserved=0
- any MIME type is specified: curl -v -H 'Accept: */*' https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.zora.uzh.ch%2Fid%2Feprint%2F143147%2F&data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7Cde1a13ecf75d4d63ac0108d9500fd33b%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637628852052173981%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&sdata=BCEMsU6y3u2fsT1aGY3%2FqTfDm4zp%2F8uiDP2d9wqhpI8%3D&reserved=0

HTTP 404 status is returned if the MIME subtype is open, e.g. 'text/*'.

==> curl -v -H 'Accept: text/*,application/*' https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.zora.u%2F&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7Cde1a13ecf75d4d63ac0108d9500fd33b%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637628852052173981%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&amp;sdata=GaV54Nvz%2BRedCqIVtKmKe0wAzHWOQ3d4Qk3ETslFR4o%3D&amp;reserved=0<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.zora.u%2F&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7Cde1a13ecf75d4d63ac0108d9500fd33b%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637628852052173981%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&amp;sdata=GaV54Nvz%2BRedCqIVtKmKe0wAzHWOQ3d4Qk3ETslFR4o%3D&amp;reserved=0>zh.ch/id/eprint/143147/

[...]
< HTTP/1.1 404 Not Found
< Date: Mon, 26 Jul 2021 08:23:04 GMT
< Server: Apache/2.4.6 (Red Hat Enterprise Linux) OpenSSL/1.0.2k-fips mod_perl/2.0.11 Perl/v5.16.3
< Cache-Control: no-store, no-cache, must-revalidate
< Strict-Transport-Security: max-age=15780000
< Transfer-Encoding: chunked
< Content-Type: text/html; charset=utf-8

The Header "Accept: text/*,application/*" should be valid. So, we think is goin wrong around CRUD.pm [line 948] - elsif( $subtype eq '*' ) {}

Is this a bug or is there a workaround? Any help is appreciated.

Have a nice day
Jens


--
Jens Witzel
Zentrale Informatik
Universit?t Z?rich
Stampfenbachstrasse 73
CH-8006 Z?rich

mail:  jens.witzel at uzh.ch
phone: +41 44 63 56777
https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.zi.uzh.ch%2F&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7Cde1a13ecf75d4d63ac0108d9500fd33b%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637628852052173981%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&amp;sdata=5T%2FNGL4r1CT3gKYCscFdEsuqiY1fYbCB2xKeUbdggi4%3D&amp;reserved=0
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ecs.soton.ac.uk/pipermail/eprints-tech/attachments/20210726/89378c6b/attachment.html