[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[EP-tech] Crawler ends up with 404, dont know how to handle MIME subtype wildcard



Hi Jens,

I can replicate the same problem on 3.4 GitHub HEAD [1].? I have created 
a GitHub issue for this [2] and will investigate.

Regards

David Newman

[1] https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Feprints%2Feprints3.4&data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C8eabf7abf285433c2e2208d950126a16%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637628862277608017%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=5hTtQ90D7W9MwZZLF44XfAddcxChtWrOcAh4qywngy4%3D&reserved=0

[2] https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Feprints%2Feprints3.4%2Fissues%2F159&data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C8eabf7abf285433c2e2208d950126a16%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637628862277608017%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=4w1KyXHaA7pr5j0uq09c0IqSTO1Mh%2BDEy5%2FV8NbqEk0%3D&reserved=0

On 26/07/2021 09:31, jens.witzel--- via Eprints-tech wrote:
> *CAUTION:* This e-mail originated outside the University of Southampton.
>
> Dear all
>
> unfortunately one of our partner crawlers reports a 404 error during 
> the download, The problem occurs when wildcards are used as mime subtype.
>
> Here an example on our repo ZORA - let us try to get publication no. 
> 143147 via CURL:
>
> HTTP 200 status is returned, when
> - no Accept header is specified: curl -v 
> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.zora.uzh.ch%2Fid%2Feprint%2F143147%2F&data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C8eabf7abf285433c2e2208d950126a16%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637628862277617973%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=2zAyP5jhRyFsI7Ds00bPN%2FlmxhIJibtYG%2B93jEaRerY%3D&reserved=0 
> <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.zora.uzh.ch%2Fid%2Feprint%2F143147%2F&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C8eabf7abf285433c2e2208d950126a16%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637628862277617973%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=2zAyP5jhRyFsI7Ds00bPN%2FlmxhIJibtYG%2B93jEaRerY%3D&amp;reserved=0>
> - an exact MIME type is specified: curl -v -H 'Accept: text/html' 
> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.zora.uzh.ch%2Fid%2Feprint%2F143147%2F&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C8eabf7abf285433c2e2208d950126a16%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637628862277617973%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=2zAyP5jhRyFsI7Ds00bPN%2FlmxhIJibtYG%2B93jEaRerY%3D&amp;reserved=0 
> <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.zora.uzh.ch%2Fid%2Feprint%2F143147%2F&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C8eabf7abf285433c2e2208d950126a16%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637628862277617973%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=2zAyP5jhRyFsI7Ds00bPN%2FlmxhIJibtYG%2B93jEaRerY%3D&amp;reserved=0>
> - any MIME type is specified: curl -v -H 'Accept: */*' 
> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.zora.uzh.ch%2Fid%2Feprint%2F143147%2F&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C8eabf7abf285433c2e2208d950126a16%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637628862277617973%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=2zAyP5jhRyFsI7Ds00bPN%2FlmxhIJibtYG%2B93jEaRerY%3D&amp;reserved=0 
> <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.zora.uzh.ch%2Fid%2Feprint%2F143147%2F&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C8eabf7abf285433c2e2208d950126a16%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637628862277617973%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=2zAyP5jhRyFsI7Ds00bPN%2FlmxhIJibtYG%2B93jEaRerY%3D&amp;reserved=0>
>
> HTTP 404 status is returned if the MIME subtype is open, e.g. 'text/*'.
>
> ==> curl -v -H 'Accept: text/*,application/*' https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.zora.u%2F&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C8eabf7abf285433c2e2208d950126a16%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637628862277617973%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=NOYHXuWi9%2BK%2FdQd7mRvOq16ucyCfAjwmEIBhBp%2BB8zY%3D&amp;reserved=0 
> <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.zora.u%2F&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C8eabf7abf285433c2e2208d950126a16%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637628862277617973%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=NOYHXuWi9%2BK%2FdQd7mRvOq16ucyCfAjwmEIBhBp%2BB8zY%3D&amp;reserved=0>zh.ch/id/eprint/143147/
>
> [...]
> < HTTP/1.1 404 Not Found
> < Date: Mon, 26 Jul 2021 08:23:04 GMT
> < Server: Apache/2.4.6 (Red Hat Enterprise Linux) OpenSSL/1.0.2k-fips 
> mod_perl/2.0.11 Perl/v5.16.3
> < Cache-Control: no-store, no-cache, must-revalidate
> < Strict-Transport-Security: max-age=15780000
> < Transfer-Encoding: chunked
> < Content-Type: text/html; charset=utf-8
>
> The Header "Accept: text/*,application/*" should be valid. So, we 
> think is goin wrong around CRUD.pm [line 948] - elsif( $subtype eq '*' 
> ) {}
>
> Is this a bug or is there a workaround? Any help is appreciated.
>
> Have a nice day
> Jens
>
>
> -- 
> Jens Witzel
> Zentrale Informatik
> Universit?t Z?rich
> Stampfenbachstrasse 73
> CH-8006 Z?rich
>
> mail: ?jens.witzel at uzh.ch
> phone: +41 44 63 56777
> https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.zi.uzh.ch%2F&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C8eabf7abf285433c2e2208d950126a16%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637628862277617973%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=V6cyjGDWBYb5zmnV%2B0lcSPvR4woGf0vPU%2BA0GEZw6j4%3D&amp;reserved=0 
> <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.zi.uzh.ch%2F&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C8eabf7abf285433c2e2208d950126a16%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637628862277617973%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=V6cyjGDWBYb5zmnV%2B0lcSPvR4woGf0vPU%2BA0GEZw6j4%3D&amp;reserved=0>
>
>
> *** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
> *** Archive: https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.eprints.org%2Ftech.php%2F&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C8eabf7abf285433c2e2208d950126a16%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637628862277617973%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=4zByLcNEdk4FApyiMQS0dLqj1LT0V16r9TGMXxwujD0%3D&amp;reserved=0
> *** EPrints community wiki: https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwiki.eprints.org%2F&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C8eabf7abf285433c2e2208d950126a16%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637628862277617973%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=1YGG5XGOJBmP%2BT0hTDrSWHegUZIo3Hl1Wf4YEgHduIs%3D&amp;reserved=0


-- 
This email has been checked for viruses by AVG.
https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.avg.com%2F&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C8eabf7abf285433c2e2208d950126a16%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637628862277627927%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=1JkozJBmhmo6ICuXgyNTfs73t2AOBaMVSVndUFJP1Ao%3D&amp;reserved=0
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ecs.soton.ac.uk/pipermail/eprints-tech/attachments/20210726/21c0f62e/attachment-0001.html