How To Remove Meta-characters From User-Supplied Data In CGI Scripts Consider an example where a CGI script accepts user-supplied data. In practice, this data may come from any number of sources of user-supplied data; but for this example, we will say that the data is taken from an environment variable $QUERY_STRING. The manner in which data was inserted into the variable is not important - the important point here is that the programmer needs to gain control over the contents of the data in $QUERY_STRING before further processing can occur. The act of gaining this control is called "sanitizing" the data. A script writer who is aware of the need to sanitize data may decide to remove a number of well-known meta-characters from the script and replace them with underscores. A common but inadvisable way to do this is by removing particular characters. For instance, in Perl: #!/usr/local/bin/perl $user_data = $ENV{'QUERY_STRING'}; # Get the data print "$user_data\n"; $user_data =~ s/[\/ ;\[\]\<\>&\t]/_/g; # Remove bad characters. WRONG! print "$user_data\n"; exit(0); In this method, the programmer determines which characters should NOT be present in the user-supplied data and removes them. The problem with this approach is that it requires the programmer to predict all possible inputs that could possibly be misused. If the user uses input not predicted by the programmer, then there is the possibility that the script may be used in a manner not intended by the programmer. A better approach is to define a list of acceptable characters and replace any character that is NOT acceptable with an underscore. The list of valid input values is typically a predictable, well-defined set of manageable size. The benefit of this approach is that the programmer is certain that whatever string is returned, it contains only characters now under his or her control. This approach contrasts with the approach we discussed earlier. In the earlier approach, which we do not recommend, the programmer must ensure that he or she traps all characters that are unacceptable, leaving no margin for error. In the recommended approach, the programmer errs on the side of caution and only needs to ensure that acceptable characters are identified; thus the programmer can be less concerned about what characters an attacker may try in an attempt to bypass security checks. Building on this philosophy, the Perl program we presented above could be thus sanitized to contain ONLY those characters allowed. For example: #!/usr/local/bin/perl $_ = $user_data = $ENV{'QUERY_STRING'}; # Get the data print "$user_data\n"; $OK_CHARS='-a-zA-Z0-9_.@'; # A restrictive list, which # should be modified to match # an appropriate RFC, for example. s/[^$OK_CHARS]/_/go; $user_data = $_; print "$user_data\n"; exit(0); Sanitizing data is recommended for other Perl operations. For instance, many Perl scripts accept arbitrary filenames from users. While the script should obviously check the filename to ensure that it represents a file that the user should have access to, the first step in any filename processing should be sanitization (as discussed above). The reason for this is that metacharacters (such as ">" and "|") have special meaning in file oriented functions in Perl. Another example is Perl scripts which call the eval function, using user-supplied arguments. A call to eval essentially represents the execution of a mini-program within the Perl script being executed. Programmers are encouraged to ensure that control is maintained over the content of the user-supplied data with the intent of preventing the user executing uncontrolled instructions within that environment.