Closed Bug 631751 Opened 13 years ago Closed 13 years ago

A saved html page is rendered improperly

Categories

(Core :: DOM: HTML Parser, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
Tracking Status
blocking2.0 --- final+

People

(Reporter: bobbyalexphilip, Assigned: hsivonen)

References

()

Details

(Keywords: regression, Whiteboard: [softblocker])

Attachments

(4 files, 2 obsolete files)

User-Agent:       Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b10) Gecko/20100101 Firefox/4.0b10
Build Identifier: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b10) Gecko/20100101 Firefox/4.0b10

I get a statement from my brokers as an html page. In firefox 4 its rendered improperly. The alignment is broken totally. Checked the same page in 3.6 as well as IE8, it is shown properly

Reproducible: Always

Steps to Reproduce:
Just use the attached HTML page
Actual Results:  
See the screenshot1

Expected Results:  
See the screenshot2
Attached image improper rendering
Attached image Desired rendering
This is very odd. I'll look into this.
Assignee: nobody → hsivonen
Status: UNCONFIRMED → NEW
Ever confirmed: true
OS: Windows 7 → All
Hardware: x86 → All
The file is UTF-16 encoded, with no BOM.  It does have a <meta> saying so, but of course the <meta> prescan won't find that.

Henri, it looks like the HTML5 parser treats the page as UTF-8 whereas the old parser treated it as UTF-16LE...
blocking2.0: --- → ?
Keywords: regression
Looks like the HTML5 spec's charset detection at http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#determining-the-character-encoding is just broken for BOM-less UTF-16 without a declared charset.  We should use the weasel-room from step 7 to do something sane here.

The old parser had a bunch of code dealing with null bytes at file start, including looking for UCS-4 BOMs, looking for some combination of '<' and '\0' in the first 4 bytes, looking for UTF-16-encoded "<!" and "<?" and "<H" and "<h" in the first 4 bytes, etc.

I think we need something like that here too; certainly to handle this page correctly.
Shouldn't be too hard to craft wording for the prescan algorithm that looks for sequences like 00 3C 00 21 00 2D 00 2D, or the same in opposite endianness... http://tools.ietf.org/html/draft-abarth-mime-sniff would need updating too.
(In reply to comment #5)
> Henri, it looks like the HTML5 parser treats the page as UTF-8 whereas the old
> parser treated it as UTF-16LE...

IE9 PP7 treats this as UTF-8, too, but it seems to drop zero bytes before tokenizing, which isn't exactly safe, so let's not go there.

It's interesting to test what happens with BOMless UTF-16 that contains characters from outside the Basic Latin range. That is, the expectation is for the content to "break" if the test case here "works" only because zeros are dropped.

It turns out that neither IE6 nor IE9 PP7 (didn't test 7 and 8) sniffs UTF-16 in these cases (spot-checked loading from local disk in IE6, too):
http://hsivonen.iki.fi/test/moz/utf-16-russian-meta-quirks.htm
http://hsivonen.iki.fi/test/moz/utf-16-russian-meta.htm
http://hsivonen.iki.fi/test/moz/utf-16-russian-no-meta-quirks.htm
http://hsivonen.iki.fi/test/moz/utf-16-russian-no-meta.htm

Given that these pages don't "work" in IE, Chrome or Safari, I think this bug should be WONTFIXed. bz, is that OK given this evidence?
I think that gives a really sucky user experience, though.  And the pages do work with Opera and Firefox 3.x.  They also "work" in IE for western users....

I really think that UAs should make more of an effort to have this work.  There are some basic heuristics we could use (e.g. the one you proposed on irc) that should have a very low false-positive rate and work well on at least the corpus of documents that IE works on.
Per IRC discussion with bz, I'll fix this for the case that seems to "work" in IE by checking if every second byte is zero in the first 1024 bytes.
Status: NEW → ASSIGNED
Attached patch Patch with broken test (obsolete) — Splinter Review
The test doesn't work. Framing probably interfering or something.
Almost certainly, yes.  :(
Blocking, let's get this test fixed and the patch landed.
blocking2.0: ? → final+
Whiteboard: [softblocker]
Fwiw, a chrome mochitest might work for the test...
Attached patch Patch using the chardet harness (obsolete) — Splinter Review
Attachment #511429 - Attachment is obsolete: true
Attachment #512139 - Flags: review?(bzbarsky)
Comment on attachment 512139 [details] [diff] [review]
Patch using the chardet harness

>+        if (i & 1) {
>+          if (evenByteNonZero) {
>+            return;
>+          }
>+          oddByteNonZero = PR_TRUE;
>+        } else {
>+          if (oddByteNonZero) {
>+            return;
>+          }
>+          evenByteNonZero = PR_TRUE;
>+        }

This (duplicated) code might be simpler if you declared the booleans as:

  // even-numbered bytes tracked at 0, odd-numbered bytes tracked at 1
  PRBool byteNonZero[2] = { PR_FALSE, PR_FALSE };

and then you can do

  if (byteNonZero[1 - (i % 2)])
    return;
  byteNonZero[i % 2] = PR_TRUE;

in the loop, and similar for the i+j loop.

I don't see why you need the last check for |oddByteNonZero| before deciding you're LE.  How can that be nonzero?

Also, wouldn't it make sense to just do the mCharset set conditionally at the end there, and hare the other three lines of code?

> +     * Sniff for if every other byte in the sniffing buffer is zero.

"Check whether every other byte...."

r=me with those changes.
Attachment #512139 - Flags: review?(bzbarsky) → review+
Oops. Can't land this. Bad behavior with files with length of one byte.
The patch requires at least 30 bytes of data (the length of "<title></title>" as UTF-16) in order to sniff as UTF-16. This avoids false positives with very short files.
Attachment #512139 - Attachment is obsolete: true
Attachment #512532 - Flags: review?(bzbarsky)
Comment on attachment 512532 [details] [diff] [review]
Address review comments, add magic limit for minimum bytes

r=me
Attachment #512532 - Flags: review?(bzbarsky) → review+
Comment on attachment 512532 [details] [diff] [review]
Address review comments, add magic limit for minimum bytes

Sounds like softblockers need explicit approval now.
Attachment #512532 - Flags: approval2.0?
Comment on attachment 512532 [details] [diff] [review]
Address review comments, add magic limit for minimum bytes

a=me
Attachment #512532 - Flags: approval2.0? → approval2.0+
Thanks.
http://hg.mozilla.org/mozilla-central/rev/8eb1b3531dd9
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Depends on: 638318
See Also: → 1625258
See Also: → 1727491
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: