Data Reign: Faster queries please !

Everybody wants their queries to run instantaneously. Business people don’t care much about the related technical issues. They have business needs, and it’s up to the IT people to take care of the how. This article will cover a few infrastructure options.

Scaling Vertically

Disks being what they are (slow) making queries run faster most of the time comes down to reducing the I/O footprint, by tuning the queries or allocating enough memory to have in-memory executions. Sometimes, the issue’s with the number of people/sessions, in which case increasing CPUs will do the trick.

However, scaling vertically has its limits. The number of CPU and memory that a server can accommodate is limited. Highly scalable servers are quite expensive as well.

Scaling Horizontally : Shared Everything database

The Shared Everything approach expects cluster nodes (servers) to share disk and memory. The tables and objects are to be seen from all nodes of the cluster.

Graphic source: Wikipedia

As there is no ‘master node’ to avoid creating a bottleneck, the database defines a master for each block. When an instance needs to read a block, it’ll ask the other members of the cluster for it’s whereabouts and value, and will request ownership before being able to read or use it. A continuous and fast communication between the nodes is thus necessary in order to avoid data corruption.

Oracle, DB2 and Sybase do have products offering shared-everything architectures.

Even though it’s a fully resilient and really powerful architecture, scaling a shared everything RDBMS horizontally has its limits. The more we add nodes to a cluster, the more inter-node communication is required. Even using Exadatas with the 40 GB/s Infiniband link, communicating over 10 nodes for some applications would be a feat.

Scaling Horizontally : Shared Nothing database

The second approach to scale horizontally is to share nothing at all. Each member of the cluster will have its own resources and data. Even though the database looks like a whole from a user point of view, physically, the data is split among multiple servers.

Graphic source: Wikipedia

The coordinator knows where each portion of the data resides on the cluster. This partitioning can be temporal (i.e. days, months or years) or per category (i.e. store x, store y, store z)

A query will then use all the cluster nodes for both accessing the data and processing the output. There is virtually no limit to the number of nodes in such a configuration.

This architecture is being used by appliances like Neteeza, EMC’s Greenplum and Terradata, and is targeted at Warehouse databases.

Which one is better?

Well, my favourite answer is ‘it depends’.

It depends on your application, on the type of data, the scalability you’re looking for, on the data protection you need, on the speed of the platform, on what you want to do, how much budget you’ve got and so on … . All the architectures presented above have their pros and cons, and will be useful in some cases, but not others.

Data Reign

Pages

description

Wednesday, 18 July 2012

Faster queries please !

No comments:

Post a Comment