Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help to handle brew livecheck from a non-UTF8 web page #11498

Closed
Yann-R opened this issue Jun 7, 2021 · 9 comments
Closed

Help to handle brew livecheck from a non-UTF8 web page #11498

Yann-R opened this issue Jun 7, 2021 · 9 comments
Labels
features New features help wanted We want help addressing this outdated PR was locked due to age

Comments

@Yann-R
Copy link
Contributor

Yann-R commented Jun 7, 2021

Provide a detailed description of the proposed feature

I'm not really proposing a feature or an evolution of brew livecheck: I'm rather looking for a workaround.

It's about processing (with regex) a web page not encoded in UTF-8 but latin1.

What is the motivation for the feature?

I try to write a livecheck block for my formula, using :page_match strategy. (That I use without problem in other own formulae)

Unfortunately, for this case, the upstream web page is very old (2005 😉) and the HTML encoding is not in UTF-8.
The HTML code clearly declares using iso-8859-1 (latin1) as encoding in the first line of the HTML:

<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="fr">

Then, brew livecheckfails with the following error message:
invalid byte sequence in UTF-8

which is quite true, since these are accented French letters in the encoding mentioned by the source file: iso-8859-1

  • Is there any way to "force" the :page_match strategy to process the content all the same? (my regex tagets only simple ASCII characters)
  • Is there any may to recode "on the fly" the content (into UTF-8) before performing the matching?
  • Is it a behaviour that should be changed in the code of brew livecheckto follow the defined encoding?
  • Is there any other ideas to make my regex works on this page not encoded in UTF-8?

Suggestions are welcome! (other that asking the upstream to rewrite their web page 😉)

How will the feature be relevant to at least 90% of Homebrew users?

Not sure that it is useful for 90% of Homebrew users.
That's why I'm rather looking for a workaround.
Regards.

What alternatives to the feature have been considered?

  • tried to target only ASCII parts of the web page
  • tried to find a way to transcode the web content on-the-fly, before matching
@Yann-R Yann-R added the features New features label Jun 7, 2021
@MikeMcQuaid MikeMcQuaid added the help wanted We want help addressing this label Jun 7, 2021
@MikeMcQuaid
Copy link
Member

@samford or @nandahkrishna may be able to help here. We'd accept a PR for this.

@samford
Copy link
Member

samford commented Jun 7, 2021

I would be interested to hear whether this issue still occurs now that #10834 is merged. Livecheck::Strategy#page_content now uses String#scrub on curl output to help avoid invalid byte sequence in UTF-8 errors. We may be able to handle encoding in a smarter fashion than String#scrub but this naive approach may work for you in the interim time (since you're only interested in ASCII characters).

This change is part of the most recent version of brew that was tagged today (3.1.10), so it would be worth trying this again after running brew update. If it's still an issue with the most recent changes, I can dig into this further.

@Yann-R
Copy link
Contributor Author

Yann-R commented Jun 7, 2021

@samford Really good news with this merge.
What a coincidence indeed, I open the discussion today, this merge occurs, and my problem of the past weeks is solved! 👍

  • Now my regex targeting only ASCII characters in the web page encoded in latin1 works like a charm.
  • I must admit that I also checked that my regex still fails with accented characters from this web page, but I'll dig into this detail later.

Thank you very much.
Best regards.

@samford
Copy link
Member

samford commented Jun 7, 2021

my regex still fails with accented characters from this web page

I should clarify that String#scrub simply replaces invalid bytes with a replacement character (I believe \uFFFD is used by default). If I had to guess, the accented characters are probably being replaced and that's why the regex doesn't match non-ASCII characters.

Would you be willing to link me to the page you're using (so I can use it for testing)? If the answer's no (completely understandable), I can probably come up with something but it would save me the trouble.

@Yann-R
Copy link
Contributor Author

Yann-R commented Jun 7, 2021

Oh a single replacement character, I understand.
Of course, you're welcome, here follows such a web page with latin1 encoding for French accented characters: https://matthieu-moy.fr/utils/

@samford
Copy link
Member

samford commented Jun 8, 2021

This issue is basically what I expected: the page content provided by curl is encoded in ISO-8859-1 but Ruby doesn't know any better and treats it as UTF-8, which causes the invalid byte sequence in UTF-8 error. To properly handle this (not just scrubbing invalid bytes), we would have to force the encoding to ISO-8859-1 and then convert it to UTF-8 (e.g., output = stdout.force_encoding("iso-8859-1").encode("utf-8")). After that, regex(/(répertoire)/i) would work because both the regex and the page content would be UTF-8.

Usually we could use the content-type header from the response to understand the encoding but that won't work for this particular page. The response for this page has a content-type header with a value of text/html; charset=utf-8, so the server isn't specifying the correct encoding for that page. This is why the page loads in a browser with the wrong encoding and you see a bunch of � characters. [Testing the same HTML content from a local server while setting a content-type header of text/html; charset=iso-8859-1 predictably causes it to load with the correct encoding.]

What makes this tricky is that #force_encoding/#encode would need to be done where we're currently using #scrub, otherwise we'll run into the invalid byte sequence... error in the subsequent while loop where we separate the curl output into the head(s) and body. For example, we can't defer #force_encoding/#encode to a strategy block because we would encounter the invalid byte sequence... error before the point where that block is executed.


Looking to the future, I think we'll have to do two things:

  • Add some code that catches the invalid byte sequence... error and automatically attempts to resolve the encoding issue. We could do this by parsing the scrubbed output to identify the final content-type header and using the charset information to re-encode the original output before parsing it. [It's necessary to search the scrubbed output, otherwise we get the invalid byte sequence... error.]
  • Allow for a configuration option in a livecheck block, where we can specify the encoding, to manually handle the situation where the content-type header is incorrect (wrong charset specified, as here), incomplete (no charset specified), or missing.

It would be good to address this issue at some point but I'm currently busy with other livecheck work and this isn't a pressing issue from the standpoint of homebrew/core and homebrew/cask. We tend to only use ASCII characters in regexes, so the current #scrub approach is fine for us at the moment.

If you're invested in this and want to create a PR, the first item on the list above (automatic re-encoding) is feasible right now but the second item can't be implemented until after I create a PR for my work to allow configuration options in livecheck blocks.

Otherwise, if the current setup is generally fine for your use case, you can technically match words that use accented characters by replacing the accented character with a dot, which matches anything (e.g., /(r.pertoire)/i instead of /(répertoire)/i). It's not a real solution but it would work for now and I'll get around to properly addressing this situation as time permits in the future.

@Yann-R
Copy link
Contributor Author

Yann-R commented Jun 8, 2021

Hello @samford what a nice investigation!

The response for this page has a content-type header with a value of text/html; charset=utf-8, so the server isn't specifying the correct encoding for that page

Good catch.
From your remark, I dig the web and discovered that the encoding declaration in the header (as for this case) still requires the charsetdeclaration in the document:

Interesting play with the character encoding 😎

if the current setup is generally fine for your use case, you can technically match words that use accented characters by replacing the accented character with a dot, which matches anything.

Yes. Just for the try, I already tested this with success. But as said above, I could set a regex with ASCII-only, so your recent merge in the code solved my main problem.
Thanks a lot👍

@MikeMcQuaid
Copy link
Member

  • Add some code that catches the invalid byte sequence... error and automatically attempts to resolve the encoding issue. We could do this by parsing the scrubbed output to identify the final content-type header and using the charset information to re-encode the original output before parsing it. [It's necessary to search the scrubbed output, otherwise we get the invalid byte sequence... error.]

This seems like the preferable option to me 👍🏻

@MikeMcQuaid
Copy link
Member

This seems like a pretty niche issue, a misconfigured server and something that's better suited to a PR for any further discussion. Sorry!

@github-actions github-actions bot added the outdated PR was locked due to age label Feb 14, 2022
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 14, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
features New features help wanted We want help addressing this outdated PR was locked due to age
Projects
None yet
Development

No branches or pull requests

3 participants