Help to handle brew livecheck from a non-UTF8 web page #11498

Yann-R · 2021-06-07T10:06:56Z

Provide a detailed description of the proposed feature

I'm not really proposing a feature or an evolution of brew livecheck: I'm rather looking for a workaround.

It's about processing (with regex) a web page not encoded in UTF-8 but latin1.

What is the motivation for the feature?

I try to write a livecheck block for my formula, using :page_match strategy. (That I use without problem in other own formulae)

Unfortunately, for this case, the upstream web page is very old (2005 😉) and the HTML encoding is not in UTF-8.
The HTML code clearly declares using iso-8859-1 (latin1) as encoding in the first line of the HTML:

<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="fr">

Then, brew livecheckfails with the following error message:
invalid byte sequence in UTF-8

which is quite true, since these are accented French letters in the encoding mentioned by the source file: iso-8859-1

Is there any way to "force" the :page_match strategy to process the content all the same? (my regex tagets only simple ASCII characters)
Is there any may to recode "on the fly" the content (into UTF-8) before performing the matching?
Is it a behaviour that should be changed in the code of brew livecheckto follow the defined encoding?
Is there any other ideas to make my regex works on this page not encoded in UTF-8?

Suggestions are welcome! (other that asking the upstream to rewrite their web page 😉)

How will the feature be relevant to at least 90% of Homebrew users?

Not sure that it is useful for 90% of Homebrew users.
That's why I'm rather looking for a workaround.
Regards.

What alternatives to the feature have been considered?

tried to target only ASCII parts of the web page
tried to find a way to transcode the web content on-the-fly, before matching

The text was updated successfully, but these errors were encountered:

MikeMcQuaid · 2021-06-07T11:40:05Z

@samford or @nandahkrishna may be able to help here. We'd accept a PR for this.

samford · 2021-06-07T15:34:14Z

I would be interested to hear whether this issue still occurs now that #10834 is merged. Livecheck::Strategy#page_content now uses String#scrub on curl output to help avoid invalid byte sequence in UTF-8 errors. We may be able to handle encoding in a smarter fashion than String#scrub but this naive approach may work for you in the interim time (since you're only interested in ASCII characters).

This change is part of the most recent version of brew that was tagged today (3.1.10), so it would be worth trying this again after running brew update. If it's still an issue with the most recent changes, I can dig into this further.

Yann-R · 2021-06-07T17:23:22Z

@samford Really good news with this merge.
What a coincidence indeed, I open the discussion today, this merge occurs, and my problem of the past weeks is solved! 👍

Now my regex targeting only ASCII characters in the web page encoded in latin1 works like a charm.
I must admit that I also checked that my regex still fails with accented characters from this web page, but I'll dig into this detail later.

Thank you very much.
Best regards.

samford · 2021-06-07T22:10:14Z

my regex still fails with accented characters from this web page

I should clarify that String#scrub simply replaces invalid bytes with a replacement character (I believe \uFFFD is used by default). If I had to guess, the accented characters are probably being replaced and that's why the regex doesn't match non-ASCII characters.

Would you be willing to link me to the page you're using (so I can use it for testing)? If the answer's no (completely understandable), I can probably come up with something but it would save me the trouble.

Yann-R · 2021-06-07T22:41:56Z

Oh a single replacement character, I understand.
Of course, you're welcome, here follows such a web page with latin1 encoding for French accented characters: https://matthieu-moy.fr/utils/

samford · 2021-06-08T01:34:02Z

This issue is basically what I expected: the page content provided by curl is encoded in ISO-8859-1 but Ruby doesn't know any better and treats it as UTF-8, which causes the invalid byte sequence in UTF-8 error. To properly handle this (not just scrubbing invalid bytes), we would have to force the encoding to ISO-8859-1 and then convert it to UTF-8 (e.g., output = stdout.force_encoding("iso-8859-1").encode("utf-8")). After that, regex(/(répertoire)/i) would work because both the regex and the page content would be UTF-8.

Usually we could use the content-type header from the response to understand the encoding but that won't work for this particular page. The response for this page has a content-type header with a value of text/html; charset=utf-8, so the server isn't specifying the correct encoding for that page. This is why the page loads in a browser with the wrong encoding and you see a bunch of � characters. [Testing the same HTML content from a local server while setting a content-type header of text/html; charset=iso-8859-1 predictably causes it to load with the correct encoding.]

What makes this tricky is that #force_encoding/#encode would need to be done where we're currently using #scrub, otherwise we'll run into the invalid byte sequence... error in the subsequent while loop where we separate the curl output into the head(s) and body. For example, we can't defer #force_encoding/#encode to a strategy block because we would encounter the invalid byte sequence... error before the point where that block is executed.

Looking to the future, I think we'll have to do two things:

Add some code that catches the invalid byte sequence... error and automatically attempts to resolve the encoding issue. We could do this by parsing the scrubbed output to identify the final content-type header and using the charset information to re-encode the original output before parsing it. [It's necessary to search the scrubbed output, otherwise we get the invalid byte sequence... error.]
Allow for a configuration option in a livecheck block, where we can specify the encoding, to manually handle the situation where the content-type header is incorrect (wrong charset specified, as here), incomplete (no charset specified), or missing.

It would be good to address this issue at some point but I'm currently busy with other livecheck work and this isn't a pressing issue from the standpoint of homebrew/core and homebrew/cask. We tend to only use ASCII characters in regexes, so the current #scrub approach is fine for us at the moment.

If you're invested in this and want to create a PR, the first item on the list above (automatic re-encoding) is feasible right now but the second item can't be implemented until after I create a PR for my work to allow configuration options in livecheck blocks.

Otherwise, if the current setup is generally fine for your use case, you can technically match words that use accented characters by replacing the accented character with a dot, which matches anything (e.g., /(r.pertoire)/i instead of /(répertoire)/i). It's not a real solution but it would work for now and I'll get around to properly addressing this situation as time permits in the future.

Yann-R · 2021-06-08T12:05:44Z

Hello @samford what a nice investigation!

The response for this page has a content-type header with a value of text/html; charset=utf-8, so the server isn't specifying the correct encoding for that page

Good catch.
From your remark, I dig the web and discovered that the encoding declaration in the header (as for this case) still requires the charsetdeclaration in the document:

https://www.w3.org/International/questions/qa-html-encoding-declarations
"Use character encoding declarations in HTTP headers if it makes sense, and if you are able, for any type of content, but in conjunction with an in-document declaration."

Interesting play with the character encoding 😎

if the current setup is generally fine for your use case, you can technically match words that use accented characters by replacing the accented character with a dot, which matches anything.

Yes. Just for the try, I already tested this with success. But as said above, I could set a regex with ASCII-only, so your recent merge in the code solved my main problem.
Thanks a lot👍

MikeMcQuaid · 2021-06-08T12:31:49Z

Add some code that catches the invalid byte sequence... error and automatically attempts to resolve the encoding issue. We could do this by parsing the scrubbed output to identify the final content-type header and using the charset information to re-encode the original output before parsing it. [It's necessary to search the scrubbed output, otherwise we get the invalid byte sequence... error.]

This seems like the preferable option to me 👍🏻

MikeMcQuaid · 2022-01-14T15:57:35Z

This seems like a pretty niche issue, a misconfigured server and something that's better suited to a PR for any further discussion. Sorry!

Yann-R added the features New features label Jun 7, 2021

MikeMcQuaid added the help wanted We want help addressing this label Jun 7, 2021

MikeMcQuaid closed this as completed Jan 14, 2022

github-actions bot added the outdated PR was locked due to age label Feb 14, 2022

github-actions bot locked as resolved and limited conversation to collaborators Feb 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Help to handle brew livecheck from a non-UTF8 web page #11498

Help to handle brew livecheck from a non-UTF8 web page #11498

Yann-R commented Jun 7, 2021

MikeMcQuaid commented Jun 7, 2021

samford commented Jun 7, 2021

Yann-R commented Jun 7, 2021

samford commented Jun 7, 2021 •

edited

Loading

Yann-R commented Jun 7, 2021 •

edited

Loading

samford commented Jun 8, 2021

Yann-R commented Jun 8, 2021

MikeMcQuaid commented Jun 8, 2021

MikeMcQuaid commented Jan 14, 2022

Help to handle brew livecheck from a non-UTF8 web page #11498

Help to handle brew livecheck from a non-UTF8 web page #11498

Comments

Yann-R commented Jun 7, 2021

Provide a detailed description of the proposed feature

What is the motivation for the feature?

How will the feature be relevant to at least 90% of Homebrew users?

What alternatives to the feature have been considered?

MikeMcQuaid commented Jun 7, 2021

samford commented Jun 7, 2021

Yann-R commented Jun 7, 2021

samford commented Jun 7, 2021 • edited Loading

Yann-R commented Jun 7, 2021 • edited Loading

samford commented Jun 8, 2021

Yann-R commented Jun 8, 2021

MikeMcQuaid commented Jun 8, 2021

MikeMcQuaid commented Jan 14, 2022

samford commented Jun 7, 2021 •

edited

Loading

Yann-R commented Jun 7, 2021 •

edited

Loading