Everybody wants their queries
to run instantaneously. Business people don’t care much about the related technical
issues. They have business needs, and it’s up to the IT people to take care
of the how. This article will cover a few infrastructure options.
Scaling Vertically
Disks being what they are (slow) making
queries run faster most of the time comes down to reducing the I/O footprint,
by tuning the queries or allocating enough memory to have in-memory executions.
Sometimes, the issue’s with the number of people/sessions, in which case
increasing CPUs will do the trick.
However, scaling vertically has its
limits. The number of CPU and memory that a server can accommodate is limited.
Highly scalable servers are quite expensive as well.
Scaling Horizontally : Shared Everything database
The Shared Everything approach expects cluster nodes
(servers) to share disk and memory. The tables and objects are to be seen from
all nodes of the cluster.
Graphic source: Wikipedia
As there is no ‘master node’ to
avoid creating a bottleneck, the database defines a master for each block. When
an instance needs to read a block, it’ll ask the other members of the cluster
for it’s whereabouts and value, and will request ownership before being able to
read or use it. A continuous and fast communication between the nodes is thus necessary
in order to avoid data corruption.
Oracle, DB2 and Sybase do have
products offering shared-everything architectures.
Even though it’s a fully
resilient and really powerful architecture, scaling a shared everything RDBMS
horizontally has its limits. The more we add nodes to a cluster, the more inter-node
communication is required. Even using Exadatas with the 40 GB/s Infiniband link,
communicating over 10 nodes for some applications would be a feat.
Scaling Horizontally : Shared Nothing database
The second approach to scale horizontally is to share
nothing at all. Each member of the cluster will have its own resources and
data. Even though the database looks like a whole from a user point of view,
physically, the data is split among multiple servers.
Graphic source: Wikipedia
The coordinator knows where each portion
of the data resides on the cluster. This partitioning can be temporal (i.e. days,
months or years) or per category (i.e. store x, store y, store z)
A query will then use all the cluster
nodes for both accessing the data and processing the output. There is virtually
no limit to the number of nodes in such a configuration.
This architecture is being used by
appliances like Neteeza, EMC’s Greenplum and Terradata, and is targeted at
Warehouse databases.
Which one is better?
Well, my favourite answer is ‘it
depends’.
It depends on your application,
on the type of data, the scalability you’re looking for, on the data protection you need, on the
speed of the platform, on what you want to do, how much budget you’ve got and
so on … . All the architectures presented above have their pros and cons, and
will be useful in some cases, but not others.
No comments:
Post a Comment