Notes on Database Design: Types for Data Validation

reading time ( words)

PostgreSQL provides TYPE and DOMAIN abstraction which impose structural and semantic constraint over your data. Wisely using these mechanisms provides solid guarantees on data quality, enforceable at the database itself. Let’s continue the Notes on Database Design series with the use of DOMAIN to validate your data.

As an example, I’ll provide an implementation of an email address type. For simplicity, let’s define an email address as the tuple of a local part and a fully-qualified domain name concatenated with the @ symbol. I’ll also refer to those parts as the right hand side and left hand side, or RHS and LHS for short.

Email addresses can be way more complex than you think. If you don’t believe me, take a look at RFC-5321 § 4.1.2 for the gory details. Also keep in mind that a syntactically correct email address can still be invalid. The only way to know for sure, is to try and send an email to it.

I’ll use a hierarchy of DOMAIN definitions to implement a fully validating type to efficiently store email addresses.

The FQDN representation

The right hand side of an email address is generally a fully-qualified domain name. Let’s build our first abstraction, to encode and verify valid domain names. To start, we need a validation function that returns TRUE when a given string can be considered a valid fqdn.

1
2
3
4
5
CREATE OR REPLACE FUNCTION valid_fqdn(d TEXT)
RETURNS BOOLEAN AS $$
  SELECT $1 ~ '^[a-z0-9][-a-z0-9]{0,62}(\.[a-z0-9][-a-z0-9]{0,62})+$'
         AND LENGTH($1) < 255;
$$ LANGUAGE SQL STRICT IMMUTABLE SECURITY DEFINER;

In this case, I’m implementing the essence of RFC-1035 § 2.3.4: Labels of 1 to 63 acceptable characters separated by a single dot, with a maximum length of 255 characters. This is easily accomplished by the regular expression and the length check highlighted.

Notice how this is just a syntax validation. Attempts to discern unresolvable domains or enforcing additional rules are destined to fail, so don’t do it.

With valid_fqdn defined, we can go ahead and declare our DOMAIN.

7
CREATE DOMAIN fqdn AS TEXT CHECK ( valid_fqdn(value) );

Now you can easily check whether strings are valid domain names or not, automatically.

Every time you refer to something of type fqdn, PostgreSQL will make sure it’s indeed a valid domain name, sparing you to add check code. If you declare a column of type fqdn, the built-in validation provided by the DOMAIN will make sure that only fully-qualified domain names that satisfy the restriction are added.

This is just a very simple DOMAIN definition. PostgreSQL provides support for more powerful definitions. Take a look at the CREATE DOMAIN and CREATE TYPE sections in the PostgreSQL documentation to learn more.

Dealing with the left hand side

To illustrate composition, I’ll define two additional DOMAIN instances:

 9
10
11
12
13
14
15
16
17
CREATE OR REPLACE FUNCTION valid_local(d TEXT)
RETURNS BOOLEAN AS $$
  SELECT $1 ~*
  '^[a-z0-9!#$%&''*+-/=?^_`{|}~]+'
  '(\.[a-z0-9!#$%&''*+-/=?^_`{|}~]+)*$'
   AND LENGTH($1) < 64;
$$ LANGUAGE SQL STRICT IMMUTABLE SECURITY DEFINER;

CREATE DOMAIN local_part AS TEXT CHECK ( valid_local(value) );

The regular expression on lines 12 and 13 actually comes from RFC-2321, while the length constraint at line 14 comes from RFC-5321 § 4.5.3.1.1. And yes, all those characters can actually be part of an email address, so our schema better be prepared to handle them all.

lem=# SELECT '>'::local_part;
ERROR:  value for domain local_part violates check constraint "local_part_check"
lem=# SELECT 'really+valid/always.fun!'::local_part;
local_part | really+valid/always.fun!

As you can see in the examples above, the validation works as expected, rejecting any string that can’t really be the local part or left hand side of an email address.

Putting it all together

I’ll now define a DOMAIN that will combine our local_part and fqdn into a mailbox, a fully syntactically valid email address. Notice the explicit casts which in turn, cause the validation of that part of the email address.

19
20
21
22
23
24
25
26
27
CREATE OR REPLACE FUNCTION valid_mailbox(d TEXT)
RETURNS BOOLEAN AS $$
  SELECT $1 ~ '^[^@]+@[^@]+$'
    AND split_part($1, '@', 1)::local_part != ''
    AND split_part($1, '@', 2)::fqdn != ''
    AND LENGTH($1) < 64 + 1 + 255;
$$ LANGUAGE SQL STRICT IMMUTABLE SECURITY DEFINER;

CREATE DOMAIN mailbox AS TEXT CHECK ( valid_mailbox(value) );

And as usual, here are some simple examples of the automated validation in action.

lem=# SELECT 'fun@postgres.help'::mailbox;
mailbox | fun@postgres.help
lem=# SELECT '<fly@quad.click'::mailbox;
ERROR:  value for domain local_part violates check constraint "local_part_check"
CONTEXT:  SQL function "valid_mailbox" statement 1
lem=# SELECT 'nosuch@domain'::mailbox;
ERROR:  value for domain fqdn violates check constraint "fqdn_check"
CONTEXT:  SQL function "valid_mailbox" statement 1

Notice the errors signaling which part of the email address is broken. These errors will automatically abort transactions where invalid data is being fed to your database, which is a very powerful integrity control.

When using database-driven integrity constraints such as in these examples, your application code no longer needs code to provide these constraints. They are enforced automatically at the database level. Better yet, if you have three or four different client applications, you still have a single point of validation that works seamlessly — and consistently! — across all applications.

Using our new fqdn domain

To illustrate the benefits of our new fqdn DOMAIN, let’s define a new customers table.

1
2
3
4
5
CREATE TABLE customers (
  id    SERIAL PRIMARY KEY,
  name  TEXT,
  email mailbox NOT NULL
) INHERITS ( audit );

Now look what happens when we try to add a customer with an invalid email address.

lem=# INSERT INTO customers ( name, email )
VALUES
( 'Jane Doe', 'badmail>@some.domain');
ERROR:  value for domain local_part violates check constraint "local_part_check"
CONTEXT:  SQL function "valid_mailbox" statement 1

And since this is implemented at the database level, your data integrity is not dependent on the client code having the right check in place: Bad data simply cannot enter the database.

In these simple examples, our domain definitions are basically TEXT with validation functions automatically attached to them. This means that it’s still possible to compare say, a mailbox and a fqdn. Perhaps this will be available at some point in the future of the DOMAIN construct.