[EP-tech] Antwort: Re: Crawler ends up with 404, dont know how to handle MIME subtype wildcard
David R Newman
drn at ecs.soton.ac.uk
Mon Jul 26 15:03:24 BST 2021
Hi Jens,
To fix your specific problem you need to modify
perl_lib/EPrints/Apache/Rewrite.pm on or around line 422:
- && (index(lc($accept), "text/html") != -1 ||
index(lc($accept),"*/*") != -1 || $accept eq "" ) ## header must be
text/html, or */*, or undef
+ && (index(lc($accept), "text/html") != -1 ||
index(lc($accept), "text/*") != -1 || index(lc($accept),"*/*") != -1 ||
$accept eq "" ) ## header must be text/html, text/*, */* or undef
I am reviewing the implication of this change and whether any further
changes are needed, as I see reference to the accept mime type in
several other places and want to see whether setting accept mime type to
text/* on other requests would still break things.
Regards
David Newman
On 26/07/2021 09:55, jens.witzel at uzh.ch wrote:
> *CAUTION:* This e-mail originated outside the University of Southampton.
>
> Dear David
>
> thank you for your support!
>
> Kind regards
> Jens
>
> --
> Jens Witzel
> Zentrale Informatik
> Universität Zürich
> Stampfenbachstrasse 73
> CH-8006 Zürich
>
> mail: jens.witzel at uzh.ch
> phone: +41 44 63 56777
> https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.zi.uzh.ch%2F&data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C89254c7729fc4b4fcdc508d9503e230c%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637629050063250415%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=w9Drirt3HpO%2FHL6Jw%2BSJM%2B6YR3ep0Qea9JkfsxldUhg%3D&reserved=0
> <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.zi.uzh.ch%2F&data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C89254c7729fc4b4fcdc508d9503e230c%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637629050063250415%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=w9Drirt3HpO%2FHL6Jw%2BSJM%2B6YR3ep0Qea9JkfsxldUhg%3D&reserved=0>
>
> Inactive hide details for "David R Newman" ---26.07.2021 10:50:37---Hi
> Jens, I can replicate the same problem on 3.4 GitHub HEA"David R
> Newman" ---26.07.2021 10:50:37---Hi Jens, I can replicate the same
> problem on 3.4 GitHub HEAD [1]. I have created
>
> Von: "David R Newman" <drn at ecs.soton.ac.uk>
> An: eprints-tech at ecs.soton.ac.uk, jens.witzel at uzh.ch
> Datum: 26.07.2021 10:50
> Betreff: Re: [EP-tech] Crawler ends up with 404, dont know how to
> handle MIME subtype wildcard
>
> ------------------------------------------------------------------------
>
>
>
> Hi Jens,
>
> I can replicate the same problem on 3.4 GitHub HEAD [1]. I have
> created a GitHub issue for this [2] and will investigate.
>
> Regards
>
> David Newman
>
> [1] _https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Feprints%2Feprints3.4_&data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C89254c7729fc4b4fcdc508d9503e230c%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637629050063250415%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=L3pPg7tkTFJMfBSMBgOjJzoQpgqfJPjBWknUfvIlR3w%3D&reserved=0
> <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Feprints%2Feprints3.4&data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C89254c7729fc4b4fcdc508d9503e230c%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637629050063250415%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=PvmufDv9TJpkb5dWg2ebQcGra8KMnWqcDEzbM2gyQzc%3D&reserved=0>
>
>
> [2] _https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Feprints%2Feprints3.4%2Fissues%2F159_&data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C89254c7729fc4b4fcdc508d9503e230c%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637629050063250415%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=D4cBaUL9pnKt47ff%2BFCtNmksS3GjWqp91F85z2p4VjU%3D&reserved=0
> <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Feprints%2Feprints3.4%2Fissues%2F159&data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C89254c7729fc4b4fcdc508d9503e230c%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637629050063250415%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=fOXOMaLvHSuja3aO4J2Ifb7P2%2Bw7SeKyThV3JsgRr2k%3D&reserved=0>
>
>
> On 26/07/2021 09:31, jens.witzel--- via Eprints-tech wrote:
>
> *CAUTION:* This e-mail originated outside the University of
> Southampton.
>
> Dear all
>
> unfortunately one of our partner crawlers reports a 404 error
> during the download, The problem occurs when wildcards are used as
> mime subtype.
>
> Here an example on our repo ZORA - let us try to get publication
> no. 143147 via CURL:
>
> HTTP 200 status is returned, when
> - no Accept header is specified: curl -v
> _https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.zora.uzh.ch%2Fid%2Feprint%2F143147%2F_&data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C89254c7729fc4b4fcdc508d9503e230c%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637629050063250415%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=P%2BU%2FjBE0hOa%2BNvlsEszYTvC7X8ZrQlmMx%2F2uhBzJGxA%3D&reserved=0
> <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.zora.uzh.ch%2Fid%2Feprint%2F143147%2F&data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C89254c7729fc4b4fcdc508d9503e230c%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637629050063250415%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=B0ugmCDz8yAfM5IDwzvpGIO%2Byoe%2B8N241%2BHRVREmM9Y%3D&reserved=0>
> - an exact MIME type is specified: curl -v -H 'Accept: text/html'
> _https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.zora.uzh.ch%2Fid%2Feprint%2F143147%2F_&data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C89254c7729fc4b4fcdc508d9503e230c%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637629050063260367%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=lav9qmxMiDlU953%2FKuErMiZM6OA3uacvAVlq%2BVtHA6o%3D&reserved=0
> <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.zora.uzh.ch%2Fid%2Feprint%2F143147%2F&data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C89254c7729fc4b4fcdc508d9503e230c%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637629050063260367%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=i73x7gunDhj2qU3nN7zZILYOVatHbySAtvZ0rDzRaXw%3D&reserved=0>
> - any MIME type is specified: curl -v -H 'Accept: */*'
> _https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.zora.uzh.ch%2Fid%2Feprint%2F143147%2F_&data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C89254c7729fc4b4fcdc508d9503e230c%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637629050063260367%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=lav9qmxMiDlU953%2FKuErMiZM6OA3uacvAVlq%2BVtHA6o%3D&reserved=0
> <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.zora.uzh.ch%2Fid%2Feprint%2F143147%2F&data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C89254c7729fc4b4fcdc508d9503e230c%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637629050063260367%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=i73x7gunDhj2qU3nN7zZILYOVatHbySAtvZ0rDzRaXw%3D&reserved=0>
>
> HTTP 404 status is returned if the MIME subtype is open, e.g.
> 'text/*'.
>
> ==> curl -v -H 'Accept: text/*,application/*' _https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.zora.u_%2F&data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C89254c7729fc4b4fcdc508d9503e230c%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637629050063260367%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=A60NE6XwGpJyDBuEouVC%2F8Phbolgm4RQI8B4zzguUT0%3D&reserved=0
> <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.zora.u%2F&data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C89254c7729fc4b4fcdc508d9503e230c%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637629050063260367%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=F2A8Oji3t0hW1ZGR%2Bk9TFhdI3KX7q3wrH6pQiMBRQkQ%3D&reserved=0>zh.ch/id/eprint/143147/
>
> [...]
> < HTTP/1.1 404 Not Found
> < Date: Mon, 26 Jul 2021 08:23:04 GMT
> < Server: Apache/2.4.6 (Red Hat Enterprise Linux)
> OpenSSL/1.0.2k-fips mod_perl/2.0.11 Perl/v5.16.3
> < Cache-Control: no-store, no-cache, must-revalidate
> < Strict-Transport-Security: max-age=15780000
> < Transfer-Encoding: chunked
> < Content-Type: text/html; charset=utf-8
>
> The Header "Accept: text/*,application/*" should be valid. So, we
> think is goin wrong around CRUD.pm [line 948] - elsif( $subtype eq
> '*' ) {}
>
> Is this a bug or is there a workaround? Any help is appreciated.
>
> Have a nice day
> Jens
>
>
> --
> Jens Witzel
> Zentrale Informatik
> Universität Zürich
> Stampfenbachstrasse 73
> CH-8006 Zürich
>
> mail: _jens.witzel at uzh.ch_ <mailto:jens.witzel at uzh.ch>
> phone: +41 44 63 56777_
> __https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.zi.uzh.ch_%2F&data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C89254c7729fc4b4fcdc508d9503e230c%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637629050063260367%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=wjVk5qSMnnSekNxpcbrxE222MQeAlTz%2B10tT4LFgkvE%3D&reserved=0
> <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.zi.uzh.ch%2F&data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C89254c7729fc4b4fcdc508d9503e230c%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637629050063260367%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=sjRPdL8TCuaj1%2FH4gNrUye0EWRT1%2F%2Fy4qYt0DUE79dI%3D&reserved=0>
>
>
>
> *** Options:
> _http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech_
> <http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech>
> *** Archive: _https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.eprints.org%2Ftech.php%2F_&data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C89254c7729fc4b4fcdc508d9503e230c%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637629050063260367%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=OhhExGbA0F7uoz04dJWHOR%2BGNvQ6psgXv32HhsaX1PE%3D&reserved=0
> <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.eprints.org%2Ftech.php%2F&data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C89254c7729fc4b4fcdc508d9503e230c%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637629050063260367%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=c%2Fpu3SiHCnIJDrTOvGkDmQxoAsT4A2GqTMCLDmAWRsk%3D&reserved=0>
> *** EPrints community wiki: _https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwiki.eprints.org%2F_&data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C89254c7729fc4b4fcdc508d9503e230c%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637629050063260367%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=jBU11l4PDSCb5WdVSZ7OLcWa5WueSrsB3ZOWmZGlQcE%3D&reserved=0
> <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwiki.eprints.org%2F&data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C89254c7729fc4b4fcdc508d9503e230c%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637629050063260367%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=37yYrCYxNZtNuF40sg3acKJjOmOfqJFp8OG0UaK8Ezg%3D&reserved=0>
>
>
>
> Virus-free. _https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.avg.com_%2F&data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C89254c7729fc4b4fcdc508d9503e230c%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637629050063270326%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=0GubC1KYN6CexprN8Cn6FBBsTL7kuiV2GK1NSXv0IPA%3D&reserved=0
> <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.avg.com%2Femail-signature%3Futm_medium%3Demail%26utm_source%3Dlink%26utm_campaign%3Dsig-email%26utm_content%3Demailclient&data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C89254c7729fc4b4fcdc508d9503e230c%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637629050063270326%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=nEz8OKKO16eYuPE4oI8f0Rs5ky4atpMT8708x6Q%2B1JQ%3D&reserved=0>
>
>
>
--
This email has been checked for viruses by AVG.
https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.avg.com%2F&data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C89254c7729fc4b4fcdc508d9503e230c%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637629050063270326%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=u1YDjdpxKK2LA1VzFbCQszJpma%2FBe3FYkXTs7clr41w%3D&reserved=0
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ecs.soton.ac.uk/pipermail/eprints-tech/attachments/20210726/9aaeff64/attachment-0001.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
Url : http://mailman.ecs.soton.ac.uk/pipermail/eprints-tech/attachments/20210726/9aaeff64/attachment-0001.gif
More information about the Eprints-tech
mailing list