Data Reign: Collect and Query Data

This article will cover data collecting and querying methods.
What is an ETL tool ?
Once we've collected the data, how do we query structured, semi-structured and unstructured data ?

How to collect data?

Most of the time, data is all over the place. Structured data will be stored in heterogeneous databases. Unstructured data will be stored in network shares, e-mails or on the internet.

To collect all this data and consolidate it somewhere in order to analyse it requires a specific set of tools called ETL, which stands for Extract-Transform-Load.

Extract: The first phase involves extracting the data from multiple sources using specific connectors. The data can be checked against filters in order to reduce the output.

Transform: All these heterogeneous data sources use different ways of storing the data. As we need to consolidate all of it in one database, an important bit of transformation is required.

Examples :
-          Transform multiple columns into one. Firstname and lastname becomes firstname.lastname
-          De-duplicating data. If two or more data sources contain the same data.
-          Sorting.
-          Modifying values. The phone numbers from the UK based database will be transformed from 07801451544 to +44 7801 451544.
-          Etc …

Load : The last phase involves loading the data into a destination Data Warehouse (DW). Most of the time, this DW will be a database of some sort. A careful design will be required for the load timing, archiving and deleting strategy, depending on the business needs.

Other variants exist, like ELT (Extract-Load-Transform) tools. But the principles remain the same.

Query the data ?

Once we’ve defined the best data types for our needs, and that we can collect it from all kind of sources, how do we query it?

Structured data

Even though Relational Databases (RDBMS) is a relatively old technology, it is still largely used is pretty much all companies. This market is dominated by Oracle, Microsoft, IBM and some open source products like MySQL and PostgresSQL.

Querying an RDBMS database uses a language called Structured Query Language (SQL). A basic example would look like:

SELECT columns FROM table_name WHERE conditions

Most IT people have already seen or used SQL. I’m not going to describe it any further in this document. Enough documentation is available online if you’re interested. For testing purposes, you can use Microsoft Access, available on most desktops.

Semi-Structured

Many APIs were developed in order to process XML data, as well as XML databases offering the ability to:

- Store data using the XML format.
- Query it using the XPATH language. XPath is based on a tree representation of the XML document.
- Transform the document if required using XLST.

Examples of such databases are BaseX, eXist, SQL Server or Oracle.

Unstructured data

Opposite to what we’ve seen above with structured and semi-structured data, unstructured data offers no data model. Data is there, and it’s up to us to find a way to use it.

Storing unstructured data has always been possible by simply using a file system. But then, there was no easy way to query all this information, which led to the creation of the NoSQL databases. These data stores may not require fixed table schemas, usually does not support join operations and may not give full ACID (atomicity, consistency, isolation, durability) guarantees.

NoSQL databases are categorised dependent on how they store data:

- Big Table Databases
- Documents store
- Key-value stores

Going through all the NoSQL databases would require an article on it’s own. Let’s take a quick look at the key-vales stores, which is one of the most common.

app_setting_width 450user1923_color Reduser1923_age 18user3371_color Blueuser4344_color Brackishuser1923_height 6' 0"user3371_age 34error_msg_457 There is no file %1 hereerror_message_1 There is no user with %1 name1923_name Jimuser1923_name Jim Smithuser1923_lname SmithApplication_Installed truelog_errors 1install_path C:\Windows\System32\RestrictedServerName localhost

As you can see above, the keys and the values are completely free, and can be used in any way the programmer or the DBA decides.

Data is queried using APIs. Basically from an attribute name the system will get the associated value. Some APIs allow searching the values as well.

Note;

NoSQL databases are quite new, and the market is still in its infancy, hence the great number of different products and implementations
,
both commercial and open source.

Data Reign

Pages

description

Wednesday, 18 July 2012

Collect and Query Data

No comments:

Post a Comment