“I like Noh theater and I like NoSql”
“So you don’t like theater and you don’t like databases?”
“No, I love Noh theater, and I love NoSql”
“But you just said…..”
— Noh Theater is a type of Japanese Musical Drama

Normalize vs Denormalize

In college, when dealing with databases, I was taught Normalize, Normalize, Normalize. Use those foreign keys, don’t store redundant data, keep is small keep it simple, ‘Sir, the private’s weapon’s name is Charlene’ 🙂

One of my jobs involved getting 10-15 million records of new data a night. Another one involved significantly complex queries for most page loads because a 230 table db was highly normalized. This volume of records (which is really not much compared to most large scale public web sites) along with the frustration of constant query optimization has had me really thinking.

Why do we normalize so? Is it for data integrity? If so, who’s mucking around in the DB doing inserts and updates and operating outside of the business rule? A lookup table is great but is it worth the referential integrity hit in the long run?

Let look at a place where a foreign key appears to be a good idea at first:
Lets say you have a db with 500 tables. On each table you keep a record of who created the record and when (entry_creator, entry_created_ts). When you first design it you think to yourself, “Let’s make the entry_creator a foreign key on the employee table”. Sounds great. But what happens when you want to report on 20 of those tables and show the entry creator name? If you are sticking to strict ANSI SQL you’ll end up joining the employee table 20 times (there are other solutions such as in PL/SQL where you could load the table once and reference it, but let’s try to stay db agnostic).

So after a while you think, “Screw this, if I’m going to be reporting this type of stuff all the time, I’m going to put the employee name right in the table and just have the foreign key be the user name”. Your reports SQL gets much simpler and your SQL should execute faster. But what happens when the employee changes their name (say they get married or divorced)?. You now have to update every table where that name exists. When you do the update you could lock those tables for a bit (especially with a big table). Not the best solution.

But you might not be able to do even that. What if you have physical paper records with the persons old name on them. That record shouldn’t be updated. In that case you need to make a new employee and deactivate the existing one. If you were really thinking you might have added a field on your employee table named something like ‘old_employee_id’ so you at least had some type of history.

So where does that leave us? Well, it depends on the situation. In some ways the normalization (using the employee_id) allows you to easily change the name, but you would lose the history and your SQL would be slower and more complex. If you denormalize though (and consider renames a new employee) you get faster SQL and history of user names. To me, denormalization in this case is completely acceptable.

Referential Integrity

Even if we denomalize like above most people would still have referential integrity enforced on the field. It makes sense. You don’t want someone entering a name of an employee in a table and not have that employee exist. But wait, you have 500 tables all with referential integrity to the same table. That’s a lot of checks against the same table. What’s that going to do to performance? Every inserted (and perhaps updated depending on the smarts of the db) is going to do a check against the employee table. That can add up.

At my last job we tinkered with removing referential integrity for certain key tables (like employee) because of performance issues. I never got to try it and get a performance data but I’m betting you, under heavy load, you might see a difference.

My thoughts on referential integrity: Use it whenever you can but if you can get away with not having it, don’t.

Whats the point of the DB then?

This brings me to my main point. If you are going to go without referential integrity, why use an RDBMS at all? Most enterprise systems hopefully don’t have developers or sys-admins mucking around in the DB at a record CUD (create, update, delete) level. There are tools to help the admin so specific business rules are enforced. If you have the tools to make sure the data is OK, then just drop the referential integrity. The performance gain alone is worth it.

But then, what does the RDBMS really get you? Exactly! Not much (OK, I’m downplaying it here, but really, if you are reporting against a production system in real time? Shame on you. Warehouse it and hit that system) . Maybe it’s time to go with something else. Welcome to NoSQL.

NoSQL is (as defined by Wikipedia here) a data storage structure that is not relational in nature. The best way to think of it is a Hashtable of Hashtables of Hashtables. The first Hashtable is a key/value pair where the key is the name of your table and the value is Hashtables where each element is a key/value pair where the key is the primary key and the value is a Hashtable of column/data . I’m oversimplifying but it’s a good mental picture.

With a structure like this you are looking at a, what, O(3) performance ( O(1) per Hashtable )? That’s fast. With the correct denormalization you could really do some quick mining. Now, this doesn’t take into consideration the loss of SQL for joins, etc. But if the data is laid out in the correct data nuggets you won’t have to worry much. And most NoSQL solutions (such as SimpleDB) have a basic idea of multiple record retrieval.

Welcome to the Cloud

This is where Cloud computing comes in. And to me, it’s just lovely up there in the clouds (it’s all I could think of. Covered in extra cheese). Amazon S3 and Amazon SimpleDB are great solutions for systems that have identified that they could do without an RDBMS. Think of what that get you:

  • The servers are somewhere else. You don’t have to worry about ping and power for them
  • Their servers are have all the fail-over architecture already in place. You know what that architecture costs to build? Save your money!
  • Fast. As fast as your data pipe will allow you basically.

I just think its brilliant. Your company saves money because you don’t have to buy any hardware, you don’t have to hire that sys-admin and hand him a beeper, you don’t have to worry about scaling the hardware when the drives go full, you don’t have to worry about backups, and the list goes on.

Honestly, my next project, if I need a db that needs high availability and I don’t have to worry about FDA regulations, I’m going with SimpleDB and S3.

About sseaman

Connect with me on Google+
This entry was posted in Database and tagged , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published.