Incident Notes Vulnerability Notes Security Improvement Modules Tech Tips Sources of Tools Training Alerts Y2K

CERT^® Coordination Center

Understanding Malicious Content Mitigation for Web Developers

CERT Advisory CA-2000-02 describes a problem with malicious tags embedded in client HTTP requests, discusses the impact of malicious scripts, and offers ways to prevent the insertion of malicious tags.

This tech tip, written for web developers, describes more specifically the steps you can take to prevent attackers from from using untrusted content to exploit your web site.

This document has the following sections:

Problem Summary
Mitigation Summary
Explicitly Setting the Character Encoding
Identifying the Special Characters
Encoding Dynamic Output Elements
Filtering Dynamic Content
Examine Cookies
Sample Filtering Code
ISO-8859-1 (Latin-1) Character Set

Problem Summary

Web pages contain both text and HTML markup that is generated by the server and interpreted by the client browser. Servers that generate static pages have full control over how the client will interpret the pages sent by the server. However, servers that generate dynamic pages do not have complete control over how their output is interpreted by the client. The heart of the issue is that if untrusted content can be introduced into a dynamic page, neither the server nor the client has enough information to recognize that this has happened and take protective actions.

In HTML, to distinguish text from markup, some characters are treated specially. The grammar of HTML determines the significance of "special" characters -- different characters are special at different points in the document. For example, the less-than sign "<" typically indicates the beginning of an HTML tag. Tags can either affect the formatting of the page or introduce a program that the browser executes (e.g., the <SCRIPT> tag introduces code from a variety of scripting languages).

Many web servers generate web pages dynamically. For example, a search engine may perform a database search and then construct a web page that contains the result of the search. Any server that creates web pages by inserting dynamic data into a template should check to make sure that the data to be inserted does not contain any special characters (e.g., "<"). If the inserted data contains special characters, the user's web browser will mistake them for HTML markup. Because HTML markup can introduce programs, the browser could interpret some data values as HTML tags or script rather than displaying them as text.

The risk of a web server not doing a check for special characters in dynamically generated web pages is that in some cases an attacker can choose the data that the web server inserts into the generated page. Then the attacker can trick the user's browser into running a program of the attacker's choice. This program will execute in the browser's security context for communicating with the legitimate web server, not the browser's security context for communicating with the attacker. Thus, the program will execute in an inappropriate security context with inappropriate privileges.

Mitigation Summary

Any data inserted into an output stream originating from a server is presented as originating from that server, even if it does not include malicious tags. Web developers must evaluate whether their sites will send untrusted data as part of an output stream.

Untrusted input can come from, but is not limited to,

URL parameters
Form elements
Cookies
Databases queries

A combination of steps must be taken to mitigate this vulnerability. These steps include

Explicitly setting the character set encoding for each page generated by the web server
Identifying special characters
Encoding dynamic output elements
Filtering specific characters in dynamic elements
Examine cookies

The following sections discuss details of each of these steps.

Explicitly Setting the Character Encoding

Many web pages leave the character encoding ("charset" parameter in HTTP) undefined. In earlier versions of HTML and HTTP, the character encoding was supposed to default to ISO-8859-1 if it wasn't defined. In fact, many browsers had a different default, so it was not possible to rely on the default being ISO-8859-1. HTML version 4 legitimizes this - if the character encoding isn't specified, any character encoding can be used.

If the web server doesn't specify which character encoding is in use, it can't tell which characters are special. Web pages with unspecified character encoding work most of the time because most character sets assign the same characters to byte values below 128. But which of the values above 128 are special? Some 16-bit character-encoding schemes have additional multi-byte representations for special characters such as "<". Some browsers recognize this alternative encoding and act on it. This is "correct" behavior, but it makes attacks using malicious scripts much harder to prevent. The server simply doesn't know which byte sequences represent the special characters.

For example, UTF-7 provides alternative encoding for "<" and ">", and several popular browsers recognize these as the start and end of a tag. This is not a bug in those browsers. If the character encoding really is UTF-7, then this is correct behavior. The problem is that it is possible to get into a situation in which the browser and the server disagree on the encoding. Web servers should set the character set, then make sure that the data they insert is free from byte sequences that are special in the specified encoding. For example:


<HTML>
<HEAD>
<META http-equiv="Content-Type"
content="text/html; charset=ISO-8859-1">
<TITLE>HTML SAMPLE</TITLE>
</HEAD>
<BODY>
<P>This is a sample HTML page
</BODY>
</HTML>

The META tag in the HEAD section of this sample HTML forces the page to use the ISO-8859-1 character set encoding.

Identifying the Special Characters

The next two steps, encoding and filtering, first require an understanding of "special characters". The HTML specification determines which characters are "special", because they have an effect on how the page is displayed. However, many web browsers try to correct common errors in HTML. As a result, they sometimes treat characters as special when, according to the specification, they aren't. In addition, the set of special characters depends on the context:

In the content of a block-level element (in the middle of a paragraph of text)
- "<" is special because it introduces a tag.
- "&" is special because it introduces a character entity.
- ">" is special because some browsers treat it as special, on the assumption that the author of the page really meant to put in an opening "<", but omitted it in error.
Attribute values
- In attribute values enclosed with double quotes, the double quotes are special because they mark the end of the attribute value.
- In attribute values enclosed with single quote, the single quotes are special because they mark the end of the attribute value.
- Attribute values without any quotes make the white-space characters such as space and tab special.
- "&" is special when used in conjunction with some attributes because it introduces a character entity.
In URLs, for example, a search engine might provide a link within the results page that the user can click to re-run the search. This can be implemented by encoding the search query inside the URL. When this is done, it introduces additional special characters:
- Space, tab, and new line are special because they mark the end of the URL.
- "&" is special because it introduces a character entity or separates CGI parameters.
- Non-ASCII characters (that is, everything above 128 in the ISO-8859-1 encoding) aren't allowed in URLs, so they are all special here.
- The "%" must be filtered from input anywhere parameters encoded with HTTP escape sequences are decoded by server-side code. The percent must be filtered if input such as "%68%65%6C%6C%6F" becomes "hello" when it appears on the web page in question.
Within the body of a <SCRIPT> </SCRIPT>
- The semicolon, parenthesis, curly braces, and new line should be filtered in situations where text could be inserted directly into a preexisting script tag.
Server-side scripts
- Server-side scripts that convert any exclamation characters (!) in input to double-quote characters (") on output might require additional filtering.
Other possibilities
- No current exploits rely on the ampersand. This character may be useful in future exploits. Conservative web page authors should filter this character out if possible.

It is important to note that individual situations may warrant including additional characters in the list of special characters. Web developers must examine their applications and determine which characters can affect their web applications.

Encoding Dynamic Output Elements

Each character in the ISO-8859-1 specification can be encoded using its numeric entry value. A complete description of the ISO-8859-1 specification can be found in the appendix of this document.

The following example uses the copyright mark in an HTML document:

<p>&#169 2000 Some Co., Inc.

The copyright character is 169 and using the &# syntax allows the author to insert encoded characters that will be interpreted by the browser.

In addition, many of the ISO-8859-1 characters include an entity name encoding. The copyright can also be done using this method:

<p>&copy; 2000 Some Co., Inc.

Encoding untrusted data has benefits over filtering untrusted data, including the preservation of visual appearance in the browser. This is important when special characters are considered acceptable.

Unfortunately, encoding all untrusted data can be resource intensive. Web developers must select a balance between encoding and the other option of data filtering.

Filtering Dynamic Content

Unfortunately, it is unclear whether there are any other characters or character combinations that can be used to expose other vulnerabilities. The recommended method is to select the set of characters that is known to be safe rather than excluding the set of characters that might be bad. For example, a form element that is expecting a person's age can be limited to the set of digits 0 through 9. There is no reason for this age element to accept any letters or other special characters. Using this positive approach of selecting the characters that are acceptable will help to reduce the ability to exploit other yet unknown vulnerabilities.

The filtering process can be done as part of the data input process, the data output process, or both. Filtering the data during the output process, just before it is rendered as part of the dynamic page, is recommended. Done correctly, this approach ensures that all dynamic content is filtered. Filtering on the input side is less effective because dynamic content can be entered into a web sites database(s) via methods other than HTTP. In this case, the web server may never see the data as part of the input process. Unless the filtering is implemented in all places where dynamic data is entered, the data elements may still be remain tainted.

Examine Cookies

One method to exploit this vulnerability involves inserting malicious content into a cookie. Web developers should carefully examine cookies that they accept and use the filtering techniques describe above to verify that they are not storing malicious content.

Sample Filtering Code

C++ Example


BYTE IsBadChar[] = {
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
0x00,0xFF,0xFF,0x00,0x00,0xFF,0xFF,0xFF,0xFF,0x00,0x00,
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
0x00,0x00,0x00,0x00,0xFF,0xFF,0x00,0xFF,0x00,0x00,0x00,
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
0x00,0x00,0x00
};

DWORD FilterBuffer(BYTE  * pString,DWORD cChLen){
	BYTE * pBad  = pString;
	BYTE * pGood = pString;
	DWORD i=0;
	if (!pString) return 0;
	for (i=0;pBad[i];i++){
		if (!IsBadChar[pBad[i]]) *pGood++ = pBad[i];
	};	
	return pGood-pString;
}

JavaScript Example


function RemoveBad(InStr){
    InStr = InStr.replace(/\</g,"");
    InStr = InStr.replace(/\>/g,"");
    InStr = InStr.replace(/\"/g,"");
    InStr = InStr.replace(/\'/g,"");
    InStr = InStr.replace(/\%/g,"");
    InStr = InStr.replace(/\;/g,"");
    InStr = InStr.replace(/\(/g,"");
    InStr = InStr.replace(/\)/g,"");
    InStr = InStr.replace(/\&/g,"");
    InStr = InStr.replace(/\+/g,"");
    return InStr;
}

Perl Example


#! The first function takes the negative approach. 
#! Use a list of bad characters to filter the data
sub FilterNeg {
    local( $fd ) = @_;
    $fd =~ s/[\<\>\"\'\%\;\)\(\&\+]//g;
    return( $fd ) ;
}

#! The second function takes the positive approach. 
#! Use a list of good characters to filter the data
sub FilterPos {
    local( $fd ) = @_;
    $fd =~ tr/A-Za-z0-9\ //dc;
    return( $fd ) ;
}

$Data = "This is a test string<script>";
$Data = &FilterNeg( $Data );
print "$Data\n";

$Data = "This is a test string<script>";
$Data = &FilterPos( $Data );
print "$Data\n";

ISO 8859-1 (Latin-1) Character Set

Number	Name	Description	Appearance
-	-	Unused	-
	-	HorizontalTab	space
	-	Linefeed	space
-	-	Unused	-
	-	Space	space
!	-	Exclamationmark	!
"	"	Quotationmark	"
#	-	Numbersign	#
$	-	Dollarsign	$
%	-	Percentsign	%
&	&	Ampersand	&
'	-	Apostrophe	'
(	-	Leftparenthesis	(
)	-	Rightparenthesis	)
*	-	Asterisk	*
+	-	Plussign	+
,	-	Comma	,
-	-	Hyphen	-
.	-	Period(fullstop)	.
/	-	Solidus(slash)	/
0-9	-	Digits(0-9)	0-9
:	-	Colon	:
;	-	Semi-colon	;
<	<	Lessthan	<
=	-	Equalssign	=
>	>	Greaterthan	>
?	-	Questionmark	?
@	-	Commercialat	@
A-Z	-	UppercaseA-Z	A-Z
[	-	Leftsquarebracket	[
\	-	Reversesolidus(backslash)	\
]	-	Rightsquarebracket	]
^	-	Caret	^
_	-	Horizontalbar	_
`	-	Acuteaccent	`
a-z	-	Lowercasea-z	a-z
{	-	Leftcurlybrace	{
\|	-	Verticalbar	\|
}	-	Rightcurlybrace	}
~	-	Tilde	~
-	-	Unused	-
		Non-breakingspace
¡	¡	Invertedexclamation	¡
¢	¢	Centsign	¢
£	£	Poundsterlingsign	£
¤	¤	Generalcurrencysign	¤
¥	¥	Yensign	¥
¦	¦	Brokenverticalbar	¦
§	§	Sectionsign	§
¨	¨	Umlaut(dierisis)	¨
©	©	Copyright	©
ª	ª	Feminineordinal	ª
«	«	Leftanglequote,guillemotleft	«
¬	¬	Notsign	¬
		Softhyphen
®	®	Registeredtrademark	®
¯	¯	Macronaccent	¯
°	°	Degreesign	°
±	±	Plusorminus	±
²	²	Superscripttwo	²
³	³	Superscriptthree	³
´	´	Acuteaccent	´
µ	µ	Microsign	µ
¶	¶	Paragraphsign	¶
·	·	Middledot	·
¸	¸	Cedilla	¸
¹	¹	Superscriptone	¹
º	º	Masculineordinal	º
»	»	Rightanglequote,guillemotright	»
¼	¼	Fraction(onequarter)	¼
½	½	Fraction(onehalf)	½
¾	¾	Fraction(threequarters)	¾
¿	¿	Invertedquestionmark	¿
À	À	CapitalA,graveaccent	À
Á	Á	CapitalA,acuteaccent	Á
Â	Â	CapitalA,circumflexaccent	Â
Ã	Ã	CapitalA,tilde	Ã
Ä	Ä	CapitalA,umlaut(dierisis)	Ä
Å	Å	CapitalA,ring	Å
Æ	Æ	CapitalAEdipthong(ligature)	Æ
Ç	Ç	CapitalC,cedilla	Ç
È	È	CapitalE,graveaccent	È
É	É	CapitaE,acuteaccent	É
Ê	Ê	CapitalE,circumflexaccent	Ê
Ë	Ë	CapitalE,umlaut(dierisis)	Ë
Ì	Ì	CapitalI,graveaccent	Ì
Í	Í	CapitalI,acuteaccent	Í
Î	Î	CapitalI,circumflexaccent	Î
Ï	Ï	CapitalI,umlaut(dierisis)	Ï
Ð	Ð	CapitalEth,Icelandic	Ð
Ñ	Ñ	CapitalN,tilde	Ñ
Ò	Ò	CapitalO,graveaccent	Ò
Ó	&Oacute	CapitalO,acuteaccent	Ó
Ô	Ô	CapitalO,circumflexaccent	Ô
Õ	Õ	CapitalO,tilde	Õ
Ö	Ö	CapitalO,umlaut(dierisis)	Ö
×	×	Multiplysign	×
Ø	Ø	CapitalO,slash	Ø
Ù	&Ugrave	CapitalU,graveaccent	Ù
Ú	Ú	CapitalU,acuteaccent	Ú
Û	Û	CapitalU,circumflexaccent	Û
Ü	Ü	CapitalU,umlaut(dierisis)	Ü
Ý	Ý	CapitalY,acuteaccent	Ý
Þ	Þ	CapitalThorn,Icelandic	Þ
ß	ß	Smallsharps,German(szligature)	ß
à	à	Smalla,graveaccent	à
á	á	Smalla,acuteaccent	á
â	â	Smalla,circumflexaccent	â
ã	ã	Smalla,tilde	ã
ä	&auml	Smalla,umlaut(dierisis)	ä
å	å	Smalla,ring	å
æ	æ	Smallaedipthong(ligature)	æ
ç	ç	Smallc,cedilla	ç
è	è	Smalle,graveaccent	è
é	é	Smalle,acuteaccent	é
ê	ê	Smalle,circumflexaccent	ê
ë	ë	Smalle,umlaut(dierisis)	ë
ì	ì	Smalli,graveaccent	ì
í	í	Smalli,acuteaccent	í
î	î	Smalli,circumflexaccent	î
ï	ï	Smalli,umlaut(dierisis)	ï
ð	ð	Smalleth,Icelandic	ð
ñ	ñ	Smalln,tilde	ñ
ò	ò	Smallo,graveaccent	òò
ó	ó	Smallo,acuteaccent	ó
ô	ô	Smallo,circumflexaccent	ô
õ	õ	Smallo,tilde	õ
ö	ö	Smallo,umlaut(dierisis)	ö
÷	÷	Divisionsign	÷
ø	ø	Smallo,slash	ø
ù	ù	Smallu,graveaccent	ù
ú	ú	Smallu,acuteaccent	ú
û	û	Smallu,circumflexaccent	û
ü	ü	Smallu,umlaut(dierisis)	ü
ý	ý	Smally,acuteaccent	ý
þ	þ	Smallthorn,Icelandic	þ
ÿ	ÿ	Smally,umlaut(dierisis)	ÿ

This document is available from: http://www.cert.org/tech_tips/malicious_code_mitigation.html

CERT/CC Contact Information

Email: cert@cert.org
Phone: +1 412-268-7090 (24-hour hotline)
Fax: +1 412-268-6989
Postal address:

CERT Coordination Center
Software Engineering Institute
Carnegie Mellon University
Pittsburgh PA 15213-3890
U.S.A.

CERT personnel answer the hotline 08:00-20:00 EST(GMT-5) / EDT(GMT-4) Monday through Friday; they are on call for emergencies during other hours, on U.S. holidays, and on weekends.

Using encryption

We strongly urge you to encrypt sensitive information sent by email. Our public PGP key is available from

http://www.cert.org/CERT_PGP.key

If you prefer to use DES, please call the CERT hotline for more information.

Getting security information

CERT publications and other security information are available from our web site

http://www.cert.org/

To be added to our mailing list for advisories and bulletins, send email to cert-advisory-request@cert.org and include SUBSCRIBE your-email-address in the subject of your message.

* "CERT" and "CERT Coordination Center" are registered in the U.S. Patent and Trademark Office.

NO WARRANTY
Any material furnished by Carnegie Mellon University and the Software Engineering Institute is furnished on an "as is" basis. Carnegie Mellon University makes no warranties of any kind, either expressed or implied as to any matter including, but not limited to, warranty of fitness for a particular purpose or merchantability, exclusivity or results obtained from use of the material. Carnegie Mellon University does not make any warranty of any kind with respect to freedom from patent, trademark, or copyright infringement.

Conditions for use, disclaimers, and sponsorship information

Revision History
Feb 2, 2000	Initial Release

	About Us \| Alerts \| Events \| FTP Archives \| Improving Security \| Other Resources \| Reports \| Survivability Research \| Training and Education

CERT® Coordination Center