FOAA

About development & stuff

PHP Validation & Sanitization

| Comments

Validation and sanitization are extremely important topics, any developer should be aware of. Especially with powerful, modern frameworks, people seem to forget about the underlying concepts and wrongly assume it’s already solved somehow. Correctly used and early on integrated, both play the central role in defending against attacks on your application*.

This article illustrates the underlying need, explaining why you should care. Then a general discussion about techniques and approaches resulting in concrete implementations of three different frameworks. At last, a weekend project of mine is shortly introduced.

*: Attacks can of course bypass your application layer completely and attack, for example, directly on the OS, network etc layer.

Why care?

You write a new app - either for yourself, your client/employer or maybe as part of an open source project. Whatever the case, you put a lot of hours and effort in it. As you might know, there are other people also putting a lot of effort in your website: Spammers, Scammers, malware distributors, blackhat SEOs and kids with too much time on their hands. They can put your painstakingly thought-out and build website to their own use - if you don’t take precautions.

You don’t want that. Either it’s just annoying, if you haven’t money at stake or it’s dangerous, if you have - or it can harm real humans, if you work with sensitive data. Once your app is breached you might either end up as a Spam centrifuge or your customers will leave you - really, I cannot think of a single good outcome.

Attacking your website can take place on multiple levels. For simplification, let’s keep them to:

  • Your app
    • Your code
    • The framework or third party modules
  • Services involved in delivering your app (eg your web server, database server)
  • Other services provided by machines your app runs on (eg FTP, SSH, ..)
  • The server operating systems or the virtualization layer
  • The network
  • And let’s not forget: yourself (or other people having privileged access to your app).

Threats to your App

Depending on your hosting situation, you might not have access on all levels. This article is about the topmost layer - the one you definitely have access to.

Ok, let’s have a quick look of what kind of attacks you face here, ordered by (but not including all) the OWASP (Open Web Application Security Project) list from 2010:

Injection

Injections encompass a great deal of different attacks. All have the goal to run code on your (server) side. Just two examples:

  • SQL injection: Trying to manipulate your database queries
    Example: http://foo.tld/?userid=' or '1'='1
    Of course, this works with any NoSQL, LDAP or whatnot database as well
  • Code injection
    Remember register_globals? This surely was an easy to use vector for those attacks..
    Other example: $res = eval('return 1 + '. $_GET['crazy']+ ';'); with ?crazy=0;exec("rm+-rf+/")

XSS

XSS stands for cross site scripting. The general idea is to plant code (Javascript, HTML, Flash) so that it will execute when other users / you visit a site of the attacked app. A quite common form would probably script planting (basically, this is also a kind of injection but the code injected is not to be run by your app/database/.., but by a user/visitor). Assume the following be part of a signup form (I did not URL encode the attac code, so you can read it better)

1
http://foo.tld/signup?username=<script>document.location="http://scam.tld";</script>&...

CSRF

CSRF stands for cross site request forgery. It is an attack which tries to ride / reuse the session of a (valid) user to do things in “his name”. Therefore the attack lures the target on his own website (or one he has compromised, eg using XSS) and executes a query in the background. The most given example is the attack misusing an img-tag. Imagine you have an account with the mail provider ACME and are lured on a page containing the following:

1
<img src="http://acme.tld/send-mail?to=boss@yourwork.tld&subject=I+quit+you+twit"/>

If you visit the malicious site, while being logged in to ACME -> search for a new job, if ACME does not protect against CSRF and your boss does not have a sense of humor. CSRF attacks are commonly more generally targeted and focused on money gain. However, I hope you get the drift..

Protect your App

So how can you avoid those attacks? Well, there is no magic PHP module you can install and all is good. But there is a way you can at least minimize the threat: Become paranoid. See everything as an attack. All user input should be treated as if it is an attack for sure! Every request is to be considered as malicious intended. Also: you try to make your app stable and once, everything works you are satisfied - forget about this: try to break it. If you fail, try harder! Or, if you have this opportunity: ask someone else to do it for you (all of this best in a testing environment, of course!). Bruce Schneier wrote a great article on this matter. I urge you to read it.

In the context of validation, sanitization and especially injections I should at least mention Web application firewalls (WAF) and IPS/IDS. However, it would exceed the scope of this article to go into detail here.

This article tries to provide some backgrounds on two major topics, which, if correctly understood and applied, are the core defenses against XSS and injection attacks.

Validation vs Sanitization

Both of those techniques are useful on their own. They can be considered completely independent, but in the context of (PHP) web development they are dealt with in the same context: handling forms.

Validation

Data validation revolves around data checking whether given input is well-formed respective given qualifications (read: rules).

For example, the string 2011-11-20 might trigger exceptions in data validation, if you’d expect a date in the format DD-MM-YYYY and not YYYY-MM-DD. Another example: sanitization might remove any (simplified) not a-z, “@” and “.” characters but validation will check whether they also look like an email and/or whether the domain of this email really exists and so on.

Data validation is often (if not always) used on a per-parameter basis (eg your email input will be checked against other rules than your amount-value input).

Data validation does not (imo: should never) alter data.

A “complete” data validation would make sanitization obsolete - however, it would either lead to bad usability (eg “silently” converting a username to lowercase, URL compatible characters is less annoying for the user than to throw as long exception as he gets it right) or end up in a huge amount of work (eg if you are sure you won’t allow the user to write HTML in any field, it’s easier to strip it from any input than to add a validation throwing an error if HTML is included for each input separately).

Sanitization

If you go to the Wikipedia data sanitization article it will immediately redirect you to the code injection article. This is because data sanitization is the implementation of code injection prevention.

Sanitization can alter the data it works on and tries. Good examples would be the stripping of any HTML code or the removing from new-line characters in email From or To fields.

Sanitization is often used globally (eg strip all HTML from any input) as well as locally (eg strip all non numeric from this particular supposed-to-be integer input).

Also referred to as sanitization are filters that manipulate the data with no immediate security reasons. For example the trimming of input strings or alike. In my opinion those are should just be called filters. Sanitizations are a subgroup / special kind of filters, for that matter.

Sanitization is not only used for input data - it can also be applied before outputting data. For example, if you do not check incoming text data for (HTML, JS, ..) injections, you can apply it whenever you output the text. In my opinion, this is only in very rare cases a good idea. Given the example before, it simply wouldn’t make sense to store data containing injection code - aside from security reviews, that is.

How to employ

Some thoughts I tend to follow:

  1. Do I allow any kind of HTML in user inputs? If no (mostly): make sure this is sanitized out as early as possible for any input.
  2. What languages are to be supported? Only English? Then drop any other characters (eg Korean)!
  3. Begin with a most strict validation. Make the rules more open as needed when needed - not before!
  4. Never, ever trust the user (input).

General implementation for MVCs

Even within a framework, there are often multiple choices to validate / sanitize data. However, there are two general approaches, defined by where those takes place: Within the controller logic and within the model (ORM) logic. Often, both of them are available to varying degrees and can be combined or work natively together.

Controller centric

Controller centric validation and sanitization follows a form-based objective: Given a known set of forms in which users input data, you define validation rules for each of those.

Pro

  • You can write minimal and targeted validation profile for each form.
  • Your validation logic is agnostic of the actual used model (or models, if you use multiple data sources).
  • Model-less validation, if you either do not use a model at all or not everywhere, where you need validation.
  • If you already have sanitized data at the controller layer, you can easier work (eg output) with it.

Contra

  • Sanitization for data source type specific injections is not a good idea in the controller.
  • Repetition of rules for forms with intersecting input fields (eg login form and signup form, both validating username).

Model centric

On a pure model centric validation and sanitization, you define validation rules once, at the model definition (of the ORM). They apply whenever the model is updated / a new instance is created. Sanitization at this point mainly considers database code injections (eg SQL injections).

Pro

  • Model-centric validation does not care “where” (in your business logic eg controller code) you need validation - so validation of different forms (partially representing a model) are easy.
  • Especially sanitization for injections should be data source specific -> it makes sense to do it in the model (otherwise the controller layer would need to be aware of the actual used data source type).

Contra

  • Complex forms, involving multiple models with inter-dependent validation requirements might not be possible.
  • Model layer is required (validation without model not possible).
  • Multiple models (read: data sources) might have no / different validation approaches -> annoying (this does not apply to sanitization).

The PHP filter extension

The PHP filter extension provides a simple approach to validate and sanitize single data inputs.

There is a pre-defined set of validation filters, providing a good default of often used validation requirements (eg: email, integer, IP, regular expressions or URL). There is also a useful set of sanitization filters, including the most common needs as well. Both can be configured with a set of filter flags.

A short example:

1
2
3
4
5
6
7
8
9
<?php
// ..
$email = filter_var($input, FILTER_SANITIZE_EMAIL);
if (filter_var($input, FILTER_VALIDATE_EMAIL)) {
    echo "Your email '$email' was accepted";
} else {
    echo "Sorry, but '$email' seems not to be an email";
}
//..

This extension is quite useful if you build a minimal, single-purpose PHP script.

Overview of Validation & Sanitization in MVCs

The showcases for the presented frameworks might give you a quick impression on how each of them tackles validation and sanitization. I will not go into great detail (each of them would merit an article of the length this one has), but provide some links, if you are interested.

Maybe you’ve already worked with one of them, then you might be interested how the others deal with the problem.

For all three frameworks, I will provide all code required to validate a simple form, consisting of a field named login, which is required but has no validation rules and a field called email which should be an email and also has an inverted regular expression rule. Both of the email rules should issue different errors. To showcase sanitization, I will strip HTML of all input. Also, I will use a filter converting an input into an URL compatible string.

Symfony 2

The used Symfony 2.1.3 uses Doctrine as ORM. Doctrine comes with validation and sanitization on it’s own. I’ll go only into the Symfony2 own validation & sanitization methods.

Validation

Validation in Symfony2 on the controller level (see above) is based on POPOs (plain old PHP objects). As always, with Symfony2, you have multiple ways to do this: Using YAML, Annotations, XML or direct PHP.

Once a POPO is created, you use the validator service in the controller to check against the user inputted data.

For this example, I’ll go with YAML configuration. As the Symfony validation requires more then one file, I’ll past all all three:

src/Test/MyBundle/Resources/config/validation.yml (the validation rules)

1
2
3
4
5
6
7
8
9
10
11
12
Test\MyBundle\Entity\Author:
    properties:
        login:
            - NotBlank: ~
        email:
            - NotBlank: ~
            - Email:
                message: "This is not an email"
            - Regex:
                pattern: "/a.*b/"
                match: false
                message: "I don't like this email"

src/Test/MyBundle/Entity/Dummy.php (the POPO)

1
2
3
4
5
6
7
<?php
// ..
class Dummy {
    public $login;
    public $email;
}
//..

src/Test/MyBundle/Controller/DefaultController.php

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
<?php
// ..
namespace Test\MyBundle\Controller;

use Symfony\Component\HttpFoundation\Response;
use Symfony\Bundle\FrameworkBundle\Controller\Controller;
use Sensio\Bundle\FrameworkExtraBundle\Configuration\Route;
use Sensio\Bundle\FrameworkExtraBundle\Configuration\Template;
use Test\MyBundle\Entity\Dummy;

class DefaultController extends Controller
{
    /**
     * @Route("/check/{name}")
     * @Template()
     */
    public function indexAction($name)
    {
        $dummy = new Dummy();
        $dummy->login = $this->get('request')->get('login');
        $dummy->email = $this->get('request')->get('email');

        $validator = $this->get('validator');
        $errors = $validator->validate($dummy);

        if (count($errors) > 0) {
            return new Response("<pre>". print_r($errors, true). "</pre>");
        } else {
            return new Response("'$name' is valid");
        }
    }
}
//..

Sanitization

As mentioned above, Doctrine comes with sanitization for database (SQL) injections. Apart from this, there is no recommended / provided input sanitization at controller level. However, using Twig in the view, output sanitization is available.

Laravel 3

The following is about Laravel 3.2.12. However, Laravel 4 is at the ready and could change most if not all.

Validation

Laravel follows a controller-centric validation strategy. Validation is done by using a static method running on the given input data and a set of rules validating the inputs. I could not find any out-of-the-box method to use the reversed regular expression (so I used a different one). It is possible to provide additional custom checks, but there is no trivial way to do this.

Form errors can be accessed from the view, but are not as natively interwoven with each others, as it is in the other two frameworks. Still on the fence what I prefer.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
<?php
// ..
class Home_Controller extends Base_Controller {

    public function action_index()
    {
        $input = Input::all();
        $rules = array(
            'login' => 'required',
            'email' => 'required|email|match:/a/',
        );
        $messages = array(
            'required' => 'You missed the :attribute field',
            'email'    => 'This is not an email',
            'match'    => "I don't like this email"
        );
        $validator = Validator::make($input, $rules, $messages);
        if ($validator->passes()) {
            return "All is good";
        } else {
            return "<pre>". print_r($validator->errors, true). "</pre>";
        }
    }
}
//..

Sanitization

There is no official implementation or recommendation for sanitizing user input. At the model layer, the PDO quoting seems to be used.

CakePHP 2

I’ve used the current stable CakePHP version 2.2.3.

Validation

CakePHP uses a model-centric approach for the validation. CakePHP provides its own model implementation, so the validation is unique.

You can extend model validators at runtime, but there is no recommended way to validate data without models.

app/Model/Dummy.php (the model)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
<?php
// ..
class Dummy extends AppModel
{

    public $validate = array(
        'login' => array(
            'required' => true,
            'rule'     => 'alphaNumeric',
        ),
        'email' => array(
            'isThere' => array(
                'rule'     => 'notEmpty',
                'required' => true,
                'message'  => 'Email is missing',
            ),
            'checkEmail' => array(
                'rule'     => 'email',
                'required' => true,
                'message'  => 'This is not an email',
            ),
            'checkLikeIt' => array(
                'rule'    => array('checkReverseRegex', '/a.*b/i'),
                'message' => 'I don\'t like this email',
                'required' => true,
            ),
        )
    );

    public function checkReverseRegex($data, $regex)
    {
        return !preg_match($regex, $data['email']);
    }
}
//..

app/Controller/DummyController.php (the controller -> using validation)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
<?php
// ..
class DummyController extends AppController
{

    public $uses = array('Dummy');

    public function index()
    {
        $this->autoRender = false;
        // used GET cause I am lazy, for POST: $this->data
        $this->Dummy->set($this->params['url']['data']);
        if ($this->Dummy->validates()) {
            print "All is good";
        } else {
            print '<pre>'. print_r($this->Dummy->validationErrors, true). '</pre>';
        }
    }

}
//..

CakePHP’s validation is tightly intervened with it’s form generation in the view layer.

Validation of non-model input is not natively supported (aside from calling the pre-defined validation methods on a per-input basis).

Sanitization

Data sanitization is implemented as a Utility which can be accessed from anywhere (controller, component, model .. even view). It follows a sanitize-all-input approach with a fixed set of predefined sanitization filters. Sanitizing specific inputs with dedicated rules is possible, but seems not to be encouraged. The existing rules concentrate on SQL and HTML injections and filtering out general suspicious unicode characters.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
<?php
// ..
App::uses('Sanitize', 'Utility');

class DummyController extends AppController
{

    public function index()
    {
        $this->autoRender = false;
        // used GET cause I am lazy, for POST: $this->data
        $clean = Sanitize::clean($this->params['url']['data'], array(
            'remove_html' => 1
        ));
        $clean['Dummy']['login'] = Inflector::slug($clean['Dummy']['login']);
        print '<pre>'. print_r($clean, true). '</pre>';
    }
}
//..

DataFilter - my weekend project

I am using Slim lately for all kind of small applications and wrote already a controller extension. Paranoid as I am, I needed a strong validation and sanitization solution as well. So I sat down and wrote a simple data filter module, which should be easily integrated into Slim - or wherever I needed it.

My focus was on:

  • Simple usage for simple cases.
  • Being able to describe validation rules in a meta language (eg JSON, YAML, ..).
  • High level control (modification) over validation at controller level .
  • Edge cases (eg dependencies between attributes).

Using the above definition, DataFilter fits in the controller-centric view. However, I am sure it could also be integrated to handle validation on model layer. The syntax is inspired from all over the place: a bit Symfony (outsourcing definitions), a bit CakePHP (straight forward rule definitions per attribute) and even a Perl module called Data::FormValidator, which is used heavily back in the days.

I’ve document already a lot on Github, so following short example of the three ways to you can employ it (programmatic, loading validation definition from JSON file or inline definition)

Programmatic

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
<?php
// ..
$profile = new \DataFilter\Profile();

// add global filters
$profile->addPreFilters(['Trim', 'StripHtml']);

// add login attrib
$profile->setAttrib('login')->setRule('default', 'AlphaNum');

// add email attrib + rules + filtes
$email = $profile->setAttrib('email');
$email->setRequired(true);
$email->setRule('checkemail', [
    'constraint' => 'Email',
    'error'      => 'This seems not to be an email'
]);
$email->setRule('checkLikeIt', [
    'constraint' => 'Regex:/a.*b/i',
    'error'      => 'I don\'t like this email'
]);
$email->addPreFilters([
    function($in) {
        return 'user+'. $in;
    }
]);

// do the checks
if ($profile->check($_POST)) {
    $data = $profile->getLastResult()->getValidData();
}
else {
    error_log("Failed, required params not given");
}
//..

From JSON

definition.json

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
{
    "attribs": {
        "login": "AlphaNum",
        "email": {
            "required": true,
            "rules": {
                "checkEmail": {
                    "constraint": "Email",
                    "error": "This seems not to be an email"
                },
                "checkEmail": {
                    "constraint": "Regex:/a.*b/i",
                    "error": "I don't like this email"
                }
            },
            "preFilters": [
                ['\\MyFilter', 'prefixWithSomething']
            ]
        }
    },
    "preFilters": [
        'Trim',
        'StripHtml'
    ]
}

MyFilter.php

1
2
3
4
5
6
7
8
<?php
// ..
class MyFilter {
    public static function prefixWithSomething($input) {
        returun 'user+'. $input;
    }
}
//..

run.php

1
2
3
4
5
6
7
8
9
10
11
12
<?php
// ..
$profile = \DataFilter\Profile::fromJson("definition.json");

// do the checks
if ($profile->check($_POST)) {
    $data = $profile->getLastResult()->getValidData();
}
else {
    error_log("Failed, required params not given");
}
//..

Inline

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
<?php
// ..
$profile = new \DataFilter\Profile([
    'attribs' => [
        'login' => 'AlphaNum',
        'email' => [
            'required' => true,
            'rules'    => [
                'checkEmail' => [
                    'constraint' => 'Email',
                    'error'      => 'This seems not to be an email'
                ],
                'checkLikeIt' => [
                    'rule'  => 'Regex:/a.*b/i',
                    'error' => 'I don\'t like this email',
                ],
            ],
            'preFilters' => [
                function($in) {
                    return 'user+'. $in;
                }
            ]
        ]
    ],
    'preFilters' => [
        'Trim',
        'StripHtml'
    ]
]);
if ($profile->check($_POST)) {
    $data = $profile->getLastResult()->getValidData();
}
else {
    error_log("Failed, required params not given");
}
//..

Todo

The module is still in early development.

Links

Comments