From: Zag Zig [ZagZig@BIGFOOT.COM]
Sent: Friday, October 06, 2000 5:50 PM
To: BUGTRAQ@SECURITYFOCUS.COM
Subject: Cross site scripting: a long term fix

Cross site scripting: a long term fix

I recently came across this Bugtraq thread sparked by
the CERT warning about cross site scripting vulnerability
in processing dynamically generated HTML that echoes the
text entered by potentially malicious visitors to the site.
http://www.securityportal.com/list-archive/bugtraq/2000/Feb/0078.html

It appears to me that there is a relatively painless long term
way to fix this problem that I have not seen discussed.
If it was mentioned elsewhere, I would like to know about it.
Here is my idea about how this should be fixed in the long run.
I hope somebody will try to shoot it down.

I distinguish between short term and long term fixes.
The short term fixes are relatively painful to use and will
be used only by the most alert and diligent web designers.
These fixes do not require any changes in browser and server software.
The long term fixes require changes in browser or server software.
They should be fairly painless to for the web content providers.
I start with the review of the relevant material I found on the web.

1.1. Press

An interesting commentary on this issue could be found
in 'The Cross-Site Scripting Scam' by John C Dvorak.
http://www.zdnet.com/pcmag/stories/opinions/0,7802,2434175,00.html
This page has a list of links to comments entered by the readers.
It appears that one of the commenting readers successfully illustrated
the problem on that page.

1.2. CERT

The CERT Advisory CA-2000-02 identifies the problem all right.
http://www.cert.org/advisories/CA-2000-02.html
It also proposes a short term fix that web designers can use
right away without any changes to browser and server software.
http://www.cert.org/tech_tips/malicious_code_mitigation.html

This short term fix is complex and not likely to be widely used.
The report does not propose any other changes in the web architecture
that could lead to a simpler, more secure, and more widely used solution.
It does not properly characterize the problem.
It does not examine which features of the web architecture
are responsible for the existence of the problem.
It makes the problem look way too complex.
This is a very simple problem.
They do not expose the simplicity of the problem
and do not propose a solution of matching simplicity.

1.3. W3C

W3C has the CERT report posted on their web site,
but I could find no other information about this problem.
http://lists.w3.org/Archives/Public/w3c-wai-ig/2000JanMar/0302.html

1.4. Microsoft

Microsoft explains why this problem cannot be fixed in
the web browser software nor in the web server software.
Designers of web pages with dynamic content must be
aware of this problem and do something to avoid it.
Although this is correct, it does not mean that browsers
and servers could not give web designers better tools
and procedures for avoiding this problem.
http://www.microsoft.com/technet/security/crsstFAQ.asp

Both sources suggest that the only solution is to filter
the dynamically generated portion of HTML on input and/or
on output. This is probably the only solution with the
current state of the browser software and the current
HTML standard. Microsoft suggests filtering the following
special characters: ' < >  ) (  & + % ; "
http://www.microsoft.com/technet/security/CSOverv.asp

I am all too familiar with this solution in web forums.
How did I discover it? By posting plain text with one of those
innocent looking characters, then finding that those characters
were missing in the formatted text, often together with other
text that followed them. It happened to me often when posting
a long URL that uses percent sign followed by the numeric value
of a character.

Applications that expect or require HTML input, such as
web forums, should be aware of HTML security problems.
Even for them, character filtering is not a good solution.
Most web programmers do not expect to find HTML or a script in
simple text input fields and they should not be asked to check for it.
Trying to solve this problem by filtering of 'special characters'
on input or output is not the right way to do it.
I do not see anything special about any of those characters.
This will make the web more complex, not more reliable.

1.5. Wrox

Another solution is presented in a Wrox article by James Brannan:
Protecting Yourself, Your Site, and Your Clients from Cross-Site Scripting Attacks.
http://www.asptoday.com/articles/20000525.htm
This applies to Active Server Pages server side scripting.
He suggests escaping dynamic HTML before sending it to a browser,
using the HTMLEncode method on the server.
This effectively quotes the HTML tags, resulting in
the markup being displayed, not acted upon.
He also writes about input and output character filtering,
but I think this is not needed when HTMLEncode is used.

1.6. Proposal to add a safe quoting tag to HTML

The HTMLEncode solution above is better than filtering.
I propose that a solution for quoting markup should be built into
the HTML specification and therefore made available to all servers
for use with both static and dynamically generated text.
The cross site scripting problem is difficult only as long as
HTML writers do not have a simple and reliable tool to prevent it.
That tool is missing in HTML along with a basic concept.
There is no way to safely quote text containing markup.
Markup is interpreted even inside the <pre> </pre>
'pre formatted' text block. A simple solution
for this problem is to add a new HTML tag
which will process all characters literally,
for example:

<text>
<script> ... </script>
</text>.

Then the server simply wraps the user input
with this tag and makes any scripts harmless.
If you want to publish HTML source as plain text,
you can simply wrap it with this tag.
This is a simplification. I will discuss the safety issues
and the required syntax for this tag later in section 2.

This tag should have been part of HTML from day one.
I take this back, make it day zero.
This tag, when applied to any text, returns that text unchanged.
Zero, when added to any number, returns that number unchanged.
In spite of this simplicity, it took a long time to discover
or invent the number zero. Solving cross scripting problem with HTML
lacking this zero tag is like multiplying with Roman numerals.

Will adding this tag cause any problems?
The possible problem is that it may delay some
sexier features: adding smell, taste, touch,
the sixth sense and the fourth dimension to the web.
This is a no-op tag, it performs no operation.
It should not be too difficult to implement it.
It would be difficult to make incompatible implementations,
but not impossible.

1.7. Can HTML quoting be made safe?

If this was a talk, I would have expected someone to interrupt me
a few paragraphs earlier where I suggested the simplified syntax.

Surely you are joking. This can be defeated easily with this input:

</text>
<script> ... </script>
<text>

Together they give:

<text>
</text>
<script> ... </script>
<text>
</text>

This would be valid HTML and would introduce a script.

2. Syntax required to make HTML quoting safe.

Making quoting safe is not difficult.
To make quoting safe we need to add
some attributes to the quoting tag.
Our no-op program needs some parameters.
It may even have to do some work.
Recent programming languages have introduced
many new ways to quote text strings.
The two that I would use here are not so recent.

2.1. Adding an end marker to the opening and closing tag.

<text end='unique string'> ... </text end='unique string'>

The opening and the closing tag must be identified
by the same string. Program sending this text
to the browser could use the current time to
to form this id. I would like this syntax
for hand coded HTML. Similar syntax for quoting
is available in many programming languages and
dates back to at least the original Unix shell.
It is known by a name that I am afraid to repeat here,
for fear of offending a grammar checker.
The name is 'here document'.

2.2. Adding the count of bytes in the text.

<text bytes='3'>ABC</text bytes='3'>
<text bytes='3'>ABC</text>

This works even better when tags are generated by
a program. Counting bytes is a cheap operation.
This type of a quoted string is older then Fortran.
Fortran borrowed it from the punched cards
as the name 'Hollerith constant' suggests.

What should the browser do if the number of bytes
received does not match the number of bytes sent?
It should throw away the string and replace it
with a string of length zero.

I also considered allowing the count to be
deferred to the closing tag for long strings.

<text bytes=''>
...
</text bytes='1001'>

This is easily defeated by the following input.

</text bytes='0'>
<script> ... </script>
<text bytes=''>

Together they generate the following HTML which is mostly valid.

<text bytes=''>
</text bytes='0'>
<script> ... </script>
<text bytes=''>
</text bytes='1001'>

This will fail on the second empty text after successfully
introducing a script.

2.3. Additional functionality

So far I have described the basic functionality
needed to fix the cross site scripting problem.
This tag is also useful as an alternative to <pre>
for static text containing HTML tags meant to be viewed,
not interpreted. To make it even more useful we could
add attributes ON and OFF to list the tags that
must or must not be interpreted within this block.

Often links are the only reason I want to use HTML
instead of plain text. This would give me plain text
with anchors:

<text end='end-of-text' on='a'>
</text end='end-of-text'>

Perhaps this should be controlled from style sheets
linking to the id or class attribute of this tag.

<text end='end-of-text' class='text-with-links'>
</text end='end-of-text'>

###