Closed
Bug 631751
Opened 14 years ago
Closed 14 years ago
A saved html page is rendered improperly
Categories
(Core :: DOM: HTML Parser, defect)
Core
DOM: HTML Parser
Tracking
()
RESOLVED
FIXED
Tracking | Status | |
---|---|---|
blocking2.0 | --- | final+ |
People
(Reporter: bobbyalexphilip, Assigned: hsivonen)
References
()
Details
(Keywords: regression, Whiteboard: [softblocker])
Attachments
(4 files, 2 obsolete files)
49.16 KB,
text/html
|
Details | |
306.00 KB,
image/png
|
Details | |
247.21 KB,
image/png
|
Details | |
9.65 KB,
patch
|
bzbarsky
:
review+
bzbarsky
:
approval2.0+
|
Details | Diff | Splinter Review |
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b10) Gecko/20100101 Firefox/4.0b10
Build Identifier: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b10) Gecko/20100101 Firefox/4.0b10
I get a statement from my brokers as an html page. In firefox 4 its rendered improperly. The alignment is broken totally. Checked the same page in 3.6 as well as IE8, it is shown properly
Reproducible: Always
Steps to Reproduce:
Just use the attached HTML page
Actual Results:
See the screenshot1
Expected Results:
See the screenshot2
Reporter | ||
Comment 1•14 years ago
|
||
Reporter | ||
Comment 2•14 years ago
|
||
Reporter | ||
Comment 3•14 years ago
|
||
Assignee | ||
Comment 4•14 years ago
|
||
This is very odd. I'll look into this.
Assignee: nobody → hsivonen
Status: UNCONFIRMED → NEW
Ever confirmed: true
Assignee | ||
Updated•14 years ago
|
OS: Windows 7 → All
Hardware: x86 → All
![]() |
||
Comment 5•14 years ago
|
||
The file is UTF-16 encoded, with no BOM. It does have a <meta> saying so, but of course the <meta> prescan won't find that.
Henri, it looks like the HTML5 parser treats the page as UTF-8 whereas the old parser treated it as UTF-16LE...
blocking2.0: --- → ?
Keywords: regression
![]() |
||
Comment 6•14 years ago
|
||
Looks like the HTML5 spec's charset detection at http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#determining-the-character-encoding is just broken for BOM-less UTF-16 without a declared charset. We should use the weasel-room from step 7 to do something sane here.
The old parser had a bunch of code dealing with null bytes at file start, including looking for UCS-4 BOMs, looking for some combination of '<' and '\0' in the first 4 bytes, looking for UTF-16-encoded "<!" and "<?" and "<H" and "<h" in the first 4 bytes, etc.
I think we need something like that here too; certainly to handle this page correctly.
Comment 7•14 years ago
|
||
Shouldn't be too hard to craft wording for the prescan algorithm that looks for sequences like 00 3C 00 21 00 2D 00 2D, or the same in opposite endianness... http://tools.ietf.org/html/draft-abarth-mime-sniff would need updating too.
Assignee | ||
Comment 8•14 years ago
|
||
(In reply to comment #5)
> Henri, it looks like the HTML5 parser treats the page as UTF-8 whereas the old
> parser treated it as UTF-16LE...
IE9 PP7 treats this as UTF-8, too, but it seems to drop zero bytes before tokenizing, which isn't exactly safe, so let's not go there.
It's interesting to test what happens with BOMless UTF-16 that contains characters from outside the Basic Latin range. That is, the expectation is for the content to "break" if the test case here "works" only because zeros are dropped.
It turns out that neither IE6 nor IE9 PP7 (didn't test 7 and 8) sniffs UTF-16 in these cases (spot-checked loading from local disk in IE6, too):
http://hsivonen.iki.fi/test/moz/utf-16-russian-meta-quirks.htm
http://hsivonen.iki.fi/test/moz/utf-16-russian-meta.htm
http://hsivonen.iki.fi/test/moz/utf-16-russian-no-meta-quirks.htm
http://hsivonen.iki.fi/test/moz/utf-16-russian-no-meta.htm
Given that these pages don't "work" in IE, Chrome or Safari, I think this bug should be WONTFIXed. bz, is that OK given this evidence?
![]() |
||
Comment 9•14 years ago
|
||
I think that gives a really sucky user experience, though. And the pages do work with Opera and Firefox 3.x. They also "work" in IE for western users....
I really think that UAs should make more of an effort to have this work. There are some basic heuristics we could use (e.g. the one you proposed on irc) that should have a very low false-positive rate and work well on at least the corpus of documents that IE works on.
Assignee | ||
Comment 10•14 years ago
|
||
Per IRC discussion with bz, I'll fix this for the case that seems to "work" in IE by checking if every second byte is zero in the first 1024 bytes.
Status: NEW → ASSIGNED
Assignee | ||
Comment 11•14 years ago
|
||
The test doesn't work. Framing probably interfering or something.
![]() |
||
Comment 12•14 years ago
|
||
Almost certainly, yes. :(
Comment 13•14 years ago
|
||
Blocking, let's get this test fixed and the patch landed.
blocking2.0: ? → final+
Whiteboard: [softblocker]
![]() |
||
Comment 14•14 years ago
|
||
Fwiw, a chrome mochitest might work for the test...
Assignee | ||
Comment 15•14 years ago
|
||
Attachment #511429 -
Attachment is obsolete: true
Attachment #512139 -
Flags: review?(bzbarsky)
![]() |
||
Comment 16•14 years ago
|
||
Comment on attachment 512139 [details] [diff] [review]
Patch using the chardet harness
>+ if (i & 1) {
>+ if (evenByteNonZero) {
>+ return;
>+ }
>+ oddByteNonZero = PR_TRUE;
>+ } else {
>+ if (oddByteNonZero) {
>+ return;
>+ }
>+ evenByteNonZero = PR_TRUE;
>+ }
This (duplicated) code might be simpler if you declared the booleans as:
// even-numbered bytes tracked at 0, odd-numbered bytes tracked at 1
PRBool byteNonZero[2] = { PR_FALSE, PR_FALSE };
and then you can do
if (byteNonZero[1 - (i % 2)])
return;
byteNonZero[i % 2] = PR_TRUE;
in the loop, and similar for the i+j loop.
I don't see why you need the last check for |oddByteNonZero| before deciding you're LE. How can that be nonzero?
Also, wouldn't it make sense to just do the mCharset set conditionally at the end there, and hare the other three lines of code?
> + * Sniff for if every other byte in the sniffing buffer is zero.
"Check whether every other byte...."
r=me with those changes.
Attachment #512139 -
Flags: review?(bzbarsky) → review+
Assignee | ||
Comment 17•14 years ago
|
||
Oops. Can't land this. Bad behavior with files with length of one byte.
Assignee | ||
Comment 18•14 years ago
|
||
The patch requires at least 30 bytes of data (the length of "<title></title>" as UTF-16) in order to sniff as UTF-16. This avoids false positives with very short files.
Attachment #512139 -
Attachment is obsolete: true
Attachment #512532 -
Flags: review?(bzbarsky)
![]() |
||
Comment 19•14 years ago
|
||
Comment on attachment 512532 [details] [diff] [review]
Address review comments, add magic limit for minimum bytes
r=me
Attachment #512532 -
Flags: review?(bzbarsky) → review+
Assignee | ||
Comment 20•14 years ago
|
||
Comment on attachment 512532 [details] [diff] [review]
Address review comments, add magic limit for minimum bytes
Sounds like softblockers need explicit approval now.
Attachment #512532 -
Flags: approval2.0?
![]() |
||
Comment 21•14 years ago
|
||
Comment on attachment 512532 [details] [diff] [review]
Address review comments, add magic limit for minimum bytes
a=me
Attachment #512532 -
Flags: approval2.0? → approval2.0+
Assignee | ||
Comment 22•14 years ago
|
||
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•