[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[EP-tech] Antwort: Re: Crawler ends up with 404, dont know how to handle MIME subtype wildcard



CAUTION: This e-mail originated outside the University of Southampton.

Dear David

thank you for your support!

Kind regards
Jens

--
Jens Witzel
Zentrale Informatik
Universit?t Z?rich
Stampfenbachstrasse 73
CH-8006 Z?rich

mail:  jens.witzel at uzh.ch
phone: +41 44 63 56777
https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.zi.uzh.ch%2F&data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C22ae0dd442bc483bfd0608d950133140%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637628867872283857%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&sdata=UuDj%2BwoX9xwVxiN8wTPqxVxYhKVpJNnu7peVg2%2BR3TQ%3D&reserved=0

[Inactive hide details for "David R Newman" ---26.07.2021 10:50:37---Hi Jens, I can replicate the same problem on 3.4 GitHub HEA]"David R Newman" ---26.07.2021 10:50:37---Hi Jens, I can replicate the same problem on 3.4 GitHub HEAD [1].  I have created

Von: "David R Newman" <drn at ecs.soton.ac.uk>
An: eprints-tech at ecs.soton.ac.uk, jens.witzel at uzh.ch
Datum: 26.07.2021 10:50
Betreff: Re: [EP-tech] Crawler ends up with 404, dont know how to handle MIME subtype wildcard

________________________________



Hi Jens,

I can replicate the same problem on 3.4 GitHub HEAD [1].  I have created a GitHub issue for this [2] and will investigate.

Regards

David Newman

[1] https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Feprints%2Feprints3.4&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C22ae0dd442bc483bfd0608d950133140%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637628867872283857%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&amp;sdata=bLZOatjiMPMEzhgql2UuKERSYX99g46WjYlXwG2eLfw%3D&amp;reserved=0

[2] https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Feprints%2Feprints3.4%2Fissues%2F159&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C22ae0dd442bc483bfd0608d950133140%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637628867872283857%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&amp;sdata=PfHDQ6z%2FKgWz9oJApdqEoYuAjrECCoXp7E4%2BlHGDE20%3D&amp;reserved=0

On 26/07/2021 09:31, jens.witzel--- via Eprints-tech wrote:

CAUTION: This e-mail originated outside the University of Southampton.

Dear all

unfortunately one of our partner crawlers reports a 404 error during the download, The problem occurs when wildcards are used as mime subtype.

Here an example on our repo ZORA - let us try to get publication no. 143147 via CURL:

HTTP 200 status is returned, when
- no Accept header is specified: curl -v https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.zora.uzh.ch%2Fid%2Feprint%2F143147%2F&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C22ae0dd442bc483bfd0608d950133140%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637628867872283857%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&amp;sdata=vy22veoZKiFxYWKqzvsuCeMI%2FtXvcm4HdvIeDS16am0%3D&amp;reserved=0<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.zora.uzh.ch%2Fid%2Feprint%2F143147%2F&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C22ae0dd442bc483bfd0608d950133140%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637628867872283857%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&amp;sdata=vy22veoZKiFxYWKqzvsuCeMI%2FtXvcm4HdvIeDS16am0%3D&amp;reserved=0>
- an exact MIME type is specified: curl -v -H 'Accept: text/html' https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.zora.uzh.ch%2Fid%2Feprint%2F143147%2F&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C22ae0dd442bc483bfd0608d950133140%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637628867872283857%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&amp;sdata=vy22veoZKiFxYWKqzvsuCeMI%2FtXvcm4HdvIeDS16am0%3D&amp;reserved=0<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.zora.uzh.ch%2Fid%2Feprint%2F143147%2F&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C22ae0dd442bc483bfd0608d950133140%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637628867872283857%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&amp;sdata=vy22veoZKiFxYWKqzvsuCeMI%2FtXvcm4HdvIeDS16am0%3D&amp;reserved=0>
- any MIME type is specified: curl -v -H 'Accept: */*' https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.zora.uzh.ch%2Fid%2Feprint%2F143147%2F&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C22ae0dd442bc483bfd0608d950133140%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637628867872283857%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&amp;sdata=vy22veoZKiFxYWKqzvsuCeMI%2FtXvcm4HdvIeDS16am0%3D&amp;reserved=0<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.zora.uzh.ch%2Fid%2Feprint%2F143147%2F&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C22ae0dd442bc483bfd0608d950133140%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637628867872283857%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&amp;sdata=vy22veoZKiFxYWKqzvsuCeMI%2FtXvcm4HdvIeDS16am0%3D&amp;reserved=0>

HTTP 404 status is returned if the MIME subtype is open, e.g. 'text/*'.

==> curl -v -H 'Accept: text/*,application/*' https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.zora.u%2F&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C22ae0dd442bc483bfd0608d950133140%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637628867872283857%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&amp;sdata=ufSDIY6Uai3HBPTppamKmub6kLPKG465P0bQ1EUawIQ%3D&amp;reserved=0<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.zora.u%2F&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C22ae0dd442bc483bfd0608d950133140%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637628867872293815%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&amp;sdata=W18fl74kDC8V12jAsTSdHcOHE%2BNBuTt38YJFxohvLGw%3D&amp;reserved=0>zh.ch/id/eprint/143147/

[...]
< HTTP/1.1 404 Not Found
< Date: Mon, 26 Jul 2021 08:23:04 GMT
< Server: Apache/2.4.6 (Red Hat Enterprise Linux) OpenSSL/1.0.2k-fips mod_perl/2.0.11 Perl/v5.16.3
< Cache-Control: no-store, no-cache, must-revalidate
< Strict-Transport-Security: max-age=15780000
< Transfer-Encoding: chunked
< Content-Type: text/html; charset=utf-8

The Header "Accept: text/*,application/*" should be valid. So, we think is goin wrong around CRUD.pm [line 948] - elsif( $subtype eq '*' ) {}

Is this a bug or is there a workaround? Any help is appreciated.

Have a nice day
Jens


--
Jens Witzel
Zentrale Informatik
Universit?t Z?rich
Stampfenbachstrasse 73
CH-8006 Z?rich

mail:  jens.witzel at uzh.ch<mailto:jens.witzel at uzh.ch>
phone: +41 44 63 56777
https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.zi.uzh.ch%2F&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C22ae0dd442bc483bfd0608d950133140%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637628867872293815%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&amp;sdata=UZxTfa0W1KIMiRwiB6MNkWpqABAwb3dx1iK0B0PDdZQ%3D&amp;reserved=0<https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.zi.uzh.ch%2F&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C22ae0dd442bc483bfd0608d950133140%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637628867872293815%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&amp;sdata=UZxTfa0W1KIMiRwiB6MNkWpqABAwb3dx1iK0B0PDdZQ%3D&amp;reserved=0>

*** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
*** Archive: https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.eprints.org%2Ftech.php%2F&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C22ae0dd442bc483bfd0608d950133140%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637628867872293815%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&amp;sdata=npKcxdqiKJ3FEP%2FidoqajO4B0MUtMybWK00SuKto%2Bi0%3D&amp;reserved=0
*** EPrints community wiki: https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwiki.eprints.org%2F&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C22ae0dd442bc483bfd0608d950133140%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637628867872293815%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&amp;sdata=1btFuWoG3WmsNFV6WT5QaFJIjWpz%2Fo2PAhvG%2ButTgKA%3D&amp;reserved=0

<https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.avg.com%2Femail-signature%3Futm_medium%3Demail%26utm_source%3Dlink%26utm_campaign%3Dsig-email%26utm_content%3Demailclient&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C22ae0dd442bc483bfd0608d950133140%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637628867872293815%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&amp;sdata=zrLOOJ2Gc38G5Vus1YvG%2B55xIX13j9DK%2FFJn%2FtjgBVM%3D&amp;reserved=0>

Virus-free. https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.avg.com%2F&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C22ae0dd442bc483bfd0608d950133140%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637628867872293815%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&amp;sdata=CvFslZGzFreUrv%2Bg1%2F6RR2kOE1xJMtZN%2BSfRS61ejgc%3D&amp;reserved=0<https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.avg.com%2Femail-signature%3Futm_medium%3Demail%26utm_source%3Dlink%26utm_campaign%3Dsig-email%26utm_content%3Demailclient&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C22ae0dd442bc483bfd0608d950133140%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637628867872293815%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&amp;sdata=zrLOOJ2Gc38G5Vus1YvG%2B55xIX13j9DK%2FFJn%2FtjgBVM%3D&amp;reserved=0>


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ecs.soton.ac.uk/pipermail/eprints-tech/attachments/20210726/39a921f8/attachment-0001.html 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
Url : http://mailman.ecs.soton.ac.uk/pipermail/eprints-tech/attachments/20210726/39a921f8/attachment-0001.gif