[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[EP-tech] Antwort: Re: Crawler ends up with 404, dont know how to handle MIME subtype wildcard



Hi Jens,

To fix your specific problem you need to modify 
perl_lib/EPrints/Apache/Rewrite.pm on or around line 422:

-?????????????????????? &&? (index(lc($accept), "text/html") != -1 || 
index(lc($accept),"*/*") != -1 || $accept eq ""? )?? ## header must be 
text/html, or */*, or undef
+?????????????????????? &&? (index(lc($accept), "text/html") != -1 || 
index(lc($accept), "text/*") != -1 || index(lc($accept),"*/*") != -1 || 
$accept eq ""? )?? ## header must be text/html, text/*, */* or undef

I am reviewing the implication of this change and whether any further 
changes are needed, as I see reference to the accept mime type in 
several other places and want to see whether setting accept mime type to 
text/* on other requests would still break things.

Regards

David Newman

On 26/07/2021 09:55, jens.witzel at uzh.ch wrote:
> *CAUTION:* This e-mail originated outside the University of Southampton.
>
> Dear David
>
> thank you for your support!
>
> Kind regards
> Jens
>
> -- 
> Jens Witzel
> Zentrale Informatik
> Universit?t Z?rich
> Stampfenbachstrasse 73
> CH-8006 Z?rich
>
> mail: ?jens.witzel at uzh.ch
> phone: +41 44 63 56777
> https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.zi.uzh.ch%2F&data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C89254c7729fc4b4fcdc508d9503e230c%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637629050063250415%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=w9Drirt3HpO%2FHL6Jw%2BSJM%2B6YR3ep0Qea9JkfsxldUhg%3D&reserved=0 
> <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.zi.uzh.ch%2F&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C89254c7729fc4b4fcdc508d9503e230c%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637629050063250415%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=w9Drirt3HpO%2FHL6Jw%2BSJM%2B6YR3ep0Qea9JkfsxldUhg%3D&amp;reserved=0>
>
> Inactive hide details for "David R Newman" ---26.07.2021 10:50:37---Hi 
> Jens, I can replicate the same problem on 3.4 GitHub HEA"David R 
> Newman" ---26.07.2021 10:50:37---Hi Jens, I can replicate the same 
> problem on 3.4 GitHub HEAD [1].? I have created
>
> Von: "David R Newman" <drn at ecs.soton.ac.uk>
> An: eprints-tech at ecs.soton.ac.uk, jens.witzel at uzh.ch
> Datum: 26.07.2021 10:50
> Betreff: Re: [EP-tech] Crawler ends up with 404, dont know how to 
> handle MIME subtype wildcard
>
> ------------------------------------------------------------------------
>
>
>
> Hi Jens,
>
> I can replicate the same problem on 3.4 GitHub HEAD [1]. ?I have 
> created a GitHub issue for this [2] and will investigate.
>
> Regards
>
> David Newman
>
> [1] _https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Feprints%2Feprints3.4_&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C89254c7729fc4b4fcdc508d9503e230c%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637629050063250415%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=L3pPg7tkTFJMfBSMBgOjJzoQpgqfJPjBWknUfvIlR3w%3D&amp;reserved=0 
> <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Feprints%2Feprints3.4&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C89254c7729fc4b4fcdc508d9503e230c%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637629050063250415%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=PvmufDv9TJpkb5dWg2ebQcGra8KMnWqcDEzbM2gyQzc%3D&amp;reserved=0> 
>
>
> [2] _https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Feprints%2Feprints3.4%2Fissues%2F159_&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C89254c7729fc4b4fcdc508d9503e230c%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637629050063250415%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=D4cBaUL9pnKt47ff%2BFCtNmksS3GjWqp91F85z2p4VjU%3D&amp;reserved=0 
> <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Feprints%2Feprints3.4%2Fissues%2F159&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C89254c7729fc4b4fcdc508d9503e230c%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637629050063250415%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=fOXOMaLvHSuja3aO4J2Ifb7P2%2Bw7SeKyThV3JsgRr2k%3D&amp;reserved=0> 
>
>
> On 26/07/2021 09:31, jens.witzel--- via Eprints-tech wrote:
>
>     *CAUTION:*?This e-mail originated outside the University of
>     Southampton.
>
>     Dear all
>
>     unfortunately one of our partner crawlers reports a 404 error
>     during the download, The problem occurs when wildcards are used as
>     mime subtype.
>
>     Here an example on our repo ZORA - let us try to get publication
>     no. 143147 via CURL:
>
>     HTTP 200 status is returned, when
>     - no Accept header is specified: curl -v
>     _https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.zora.uzh.ch%2Fid%2Feprint%2F143147%2F_&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C89254c7729fc4b4fcdc508d9503e230c%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637629050063250415%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=P%2BU%2FjBE0hOa%2BNvlsEszYTvC7X8ZrQlmMx%2F2uhBzJGxA%3D&amp;reserved=0
>     <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.zora.uzh.ch%2Fid%2Feprint%2F143147%2F&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C89254c7729fc4b4fcdc508d9503e230c%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637629050063250415%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=B0ugmCDz8yAfM5IDwzvpGIO%2Byoe%2B8N241%2BHRVREmM9Y%3D&amp;reserved=0>
>     - an exact MIME type is specified: curl -v -H 'Accept: text/html'
>     _https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.zora.uzh.ch%2Fid%2Feprint%2F143147%2F_&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C89254c7729fc4b4fcdc508d9503e230c%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637629050063260367%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=lav9qmxMiDlU953%2FKuErMiZM6OA3uacvAVlq%2BVtHA6o%3D&amp;reserved=0
>     <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.zora.uzh.ch%2Fid%2Feprint%2F143147%2F&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C89254c7729fc4b4fcdc508d9503e230c%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637629050063260367%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=i73x7gunDhj2qU3nN7zZILYOVatHbySAtvZ0rDzRaXw%3D&amp;reserved=0>
>     - any MIME type is specified: curl -v -H 'Accept: */*'
>     _https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.zora.uzh.ch%2Fid%2Feprint%2F143147%2F_&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C89254c7729fc4b4fcdc508d9503e230c%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637629050063260367%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=lav9qmxMiDlU953%2FKuErMiZM6OA3uacvAVlq%2BVtHA6o%3D&amp;reserved=0
>     <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.zora.uzh.ch%2Fid%2Feprint%2F143147%2F&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C89254c7729fc4b4fcdc508d9503e230c%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637629050063260367%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=i73x7gunDhj2qU3nN7zZILYOVatHbySAtvZ0rDzRaXw%3D&amp;reserved=0>
>
>     HTTP 404 status is returned if the MIME subtype is open, e.g.
>     'text/*'.
>
>     ==> curl -v -H 'Accept: text/*,application/*' _https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.zora.u_%2F&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C89254c7729fc4b4fcdc508d9503e230c%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637629050063260367%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=A60NE6XwGpJyDBuEouVC%2F8Phbolgm4RQI8B4zzguUT0%3D&amp;reserved=0
>     <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.zora.u%2F&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C89254c7729fc4b4fcdc508d9503e230c%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637629050063260367%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=F2A8Oji3t0hW1ZGR%2Bk9TFhdI3KX7q3wrH6pQiMBRQkQ%3D&amp;reserved=0>zh.ch/id/eprint/143147/
>
>     [...]
>     < HTTP/1.1 404 Not Found
>     < Date: Mon, 26 Jul 2021 08:23:04 GMT
>     < Server: Apache/2.4.6 (Red Hat Enterprise Linux)
>     OpenSSL/1.0.2k-fips mod_perl/2.0.11 Perl/v5.16.3
>     < Cache-Control: no-store, no-cache, must-revalidate
>     < Strict-Transport-Security: max-age=15780000
>     < Transfer-Encoding: chunked
>     < Content-Type: text/html; charset=utf-8
>
>     The Header "Accept: text/*,application/*" should be valid. So, we
>     think is goin wrong around CRUD.pm [line 948] - elsif( $subtype eq
>     '*' ) {}
>
>     Is this a bug or is there a workaround? Any help is appreciated.
>
>     Have a nice day
>     Jens
>
>
>     -- 
>     Jens Witzel
>     Zentrale Informatik
>     Universit?t Z?rich
>     Stampfenbachstrasse 73
>     CH-8006 Z?rich
>
>     mail: _jens.witzel at uzh.ch_ <mailto:jens.witzel at uzh.ch>
>     phone: +41 44 63 56777_
>     __https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.zi.uzh.ch_%2F&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C89254c7729fc4b4fcdc508d9503e230c%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637629050063260367%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=wjVk5qSMnnSekNxpcbrxE222MQeAlTz%2B10tT4LFgkvE%3D&amp;reserved=0
>     <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.zi.uzh.ch%2F&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C89254c7729fc4b4fcdc508d9503e230c%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637629050063260367%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=sjRPdL8TCuaj1%2FH4gNrUye0EWRT1%2F%2Fy4qYt0DUE79dI%3D&amp;reserved=0>
>
>
>
>     *** Options:
>     _http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech_
>     <http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech>
>     *** Archive: _https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.eprints.org%2Ftech.php%2F_&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C89254c7729fc4b4fcdc508d9503e230c%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637629050063260367%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=OhhExGbA0F7uoz04dJWHOR%2BGNvQ6psgXv32HhsaX1PE%3D&amp;reserved=0
>     <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.eprints.org%2Ftech.php%2F&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C89254c7729fc4b4fcdc508d9503e230c%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637629050063260367%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=c%2Fpu3SiHCnIJDrTOvGkDmQxoAsT4A2GqTMCLDmAWRsk%3D&amp;reserved=0>
>     *** EPrints community wiki: _https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwiki.eprints.org%2F_&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C89254c7729fc4b4fcdc508d9503e230c%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637629050063260367%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=jBU11l4PDSCb5WdVSZ7OLcWa5WueSrsB3ZOWmZGlQcE%3D&amp;reserved=0
>     <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwiki.eprints.org%2F&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C89254c7729fc4b4fcdc508d9503e230c%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637629050063260367%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=37yYrCYxNZtNuF40sg3acKJjOmOfqJFp8OG0UaK8Ezg%3D&amp;reserved=0>
>
> 	
>
>     Virus-free. _https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.avg.com_%2F&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C89254c7729fc4b4fcdc508d9503e230c%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637629050063270326%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=0GubC1KYN6CexprN8Cn6FBBsTL7kuiV2GK1NSXv0IPA%3D&amp;reserved=0
>     <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.avg.com%2Femail-signature%3Futm_medium%3Demail%26utm_source%3Dlink%26utm_campaign%3Dsig-email%26utm_content%3Demailclient&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C89254c7729fc4b4fcdc508d9503e230c%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637629050063270326%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=nEz8OKKO16eYuPE4oI8f0Rs5ky4atpMT8708x6Q%2B1JQ%3D&amp;reserved=0>
>
>
>


-- 
This email has been checked for viruses by AVG.
https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.avg.com%2F&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C89254c7729fc4b4fcdc508d9503e230c%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637629050063270326%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=u1YDjdpxKK2LA1VzFbCQszJpma%2FBe3FYkXTs7clr41w%3D&amp;reserved=0
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ecs.soton.ac.uk/pipermail/eprints-tech/attachments/20210726/9aaeff64/attachment-0001.html 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
Url : http://mailman.ecs.soton.ac.uk/pipermail/eprints-tech/attachments/20210726/9aaeff64/attachment-0001.gif