Pages

description

Data is a real challenge facing most companies and organisations. This data has to be stored, copied, backed up, archived and deleted. Storing and managing it is half the story though, it has to be cleanesed, analysed and put to good use.Data is the 21st century's oil. My blog's will present my view on it. Please take it as it is, a point of view.

Wednesday, 18 July 2012

Collect and Query Data


This article will cover data collecting and querying methods.
What is an ETL tool ?
Once we've collected the data, how do we query structured, semi-structured and unstructured data ?







How to collect data?

Most of the time, data is all over the place. Structured data will be stored in heterogeneous databases. Unstructured data will be stored in network shares, e-mails or on the internet.
To collect all this data and consolidate it somewhere in order to analyse it requires a specific set of tools called ETL, which stands for Extract-Transform-Load.

Extract: The first phase involves extracting the data from multiple sources using specific connectors. The data can be checked against filters in order to reduce the output.    
Transform: All these heterogeneous data sources use different ways of storing the data. As we need to consolidate all of it in one database, an important bit of transformation is required.
Examples :
-          Transform multiple columns into one. Firstname and lastname becomes firstname.lastname
-          De-duplicating data. If two or more data sources contain the same data.
-          Sorting.
-          Modifying values. The phone numbers from the UK based database will be transformed from 07801451544 to +44 7801 451544.
-          Etc …
Load : The last phase involves loading the data into a destination Data Warehouse (DW). Most of the time, this DW will be a database of some sort. A careful design will be required for the load timing, archiving and deleting strategy, depending on the business needs.
Other variants exist, like ELT (Extract-Load-Transform) tools. But the principles remain the same.

Query the data ?

Once we’ve defined the best data types for our needs, and that we can collect it from all kind of sources, how do we query it?

Structured data
Even though Relational Databases (RDBMS) is a relatively old technology, it is still largely used is pretty much all companies. This market is dominated by Oracle, Microsoft, IBM and some open source products like MySQL and PostgresSQL.
Querying an RDBMS database uses a language called Structured Query Language (SQL). A basic example would look like:
SELECT columns FROM table_name WHERE conditions
Most IT people have already seen or used SQL. I’m not going to describe it any further in this document. Enough documentation is available online if you’re interested. For testing purposes, you can use Microsoft Access, available on most desktops.

Semi-Structured
Many APIs were developed in order to process XML data, as well as XML databases offering the ability to:
  • -          Store data using the XML format.
  • -          Query it using the XPATH language. XPath is based on a tree representation of the XML document.
  • -          Transform the document if required using XLST.

Examples of such databases are BaseX, eXist, SQL Server or Oracle.

Unstructured data

Opposite to what we’ve seen above with structured and semi-structured data, unstructured data offers no data model. Data is there, and it’s up to us to find a way to use it.
Storing unstructured data has always been possible by simply using a file system. But then, there was no easy way to query all this information, which led to the creation of the NoSQL databases. These data stores may not require fixed table schemas, usually does not support join operations and may not give full ACID (atomicity, consistency, isolation, durability) guarantees.
NoSQL databases are categorised dependent on how they store data:
  • -          Big Table Databases
  • -          Documents store
  • -          Key-value stores

Going through all the NoSQL databases would require an article on it’s own. Let’s take a quick look at the key-vales stores, which is one of the most common.
app_setting_width      450user1923_color         Reduser1923_age           18user3371_color         Blueuser4344_color         Brackishuser1923_height        6' 0"user3371_age           34error_msg_457          There is no file %1 hereerror_message_1        There is no user with %1 name1923_name              Jimuser1923_name          Jim Smithuser1923_lname         SmithApplication_Installed  truelog_errors             1install_path           C:\Windows\System32\RestrictedServerName             localhost

As you can see above, the keys and the values are completely free, and can be used in any way the programmer or the DBA decides.
Data is queried using APIs. Basically from an attribute name the system will get the associated value. Some APIs allow searching the values as well.

Note;
NoSQL databases are quite new, and the market is still in its infancy, hence the great number of different products and implementations
,
both commercial and open source.

No comments:

Post a Comment