This article will cover data collecting and querying methods.
What is an ETL tool ?
Once we've collected the data, how do we query structured, semi-structured and unstructured data ?
How to collect data?
What is an ETL tool ?
Once we've collected the data, how do we query structured, semi-structured and unstructured data ?
How to collect data?
Most of the time, data is all
over the place. Structured data will be stored in heterogeneous databases.
Unstructured data will be stored in network shares, e-mails or on the internet.
To collect all this data and
consolidate it somewhere in order to analyse it requires a specific set of
tools called ETL, which stands for Extract-Transform-Load.
Extract: The first phase involves extracting the data
from multiple sources using specific connectors. The data can be checked
against filters in order to reduce the output.
Transform: All these heterogeneous data sources use
different ways of storing the data. As we need to consolidate all of it in one
database, an important bit of transformation is required.
Examples :
- Transform multiple columns into one. Firstname and lastname becomes firstname.lastname
- De-duplicating data. If two or more data sources contain the same data.
- Sorting.
- Modifying values. The phone numbers from the UK based database will be transformed from 07801451544 to +44 7801 451544.
- Etc …
Load : The last phase involves loading the data into
a destination Data Warehouse (DW). Most of the time, this DW will be a database
of some sort. A careful design will be required for the load timing, archiving
and deleting strategy, depending on the business needs.
Other variants exist, like ELT (Extract-Load-Transform) tools.
But the principles remain the same.
Query the
data ?
Once we’ve defined the best data
types for our needs, and that we can collect it from all kind of sources, how
do we query it?
Structured data
Even though Relational Databases (RDBMS) is a relatively old
technology, it is still largely used is pretty much all companies. This market
is dominated by Oracle, Microsoft, IBM and some open source products like MySQL
and PostgresSQL.
Querying an RDBMS database uses a language called Structured
Query Language (SQL). A basic example would look like:
SELECT columns FROM table_name WHERE conditions
Most IT people have already seen
or used SQL. I’m not going to describe it any further in this document. Enough
documentation is available online if you’re interested. For testing purposes,
you can use Microsoft Access, available on most desktops.
Semi-Structured
Many APIs were developed in
order to process XML data, as well as XML databases offering the ability to:
- - Store data using the XML format.
- - Query it using the XPATH language. XPath is based on a tree representation of the XML document.
- - Transform the document if required using XLST.
Examples of such databases
are BaseX, eXist, SQL Server or Oracle.
Opposite to what we’ve seen above with
structured and semi-structured data, unstructured data offers no data model. Data
is there, and it’s up to us to find a way to use it.
Storing unstructured data has always
been possible by simply using a file system. But then, there was no easy way to
query all this information, which led to the creation of the NoSQL databases. These data stores may not require fixed table schemas, usually does not support join operations and may not give full ACID (atomicity, consistency, isolation,
durability) guarantees.
NoSQL databases are categorised dependent
on how they store data:
- - Big Table Databases
- - Documents store
- - Key-value stores
Going through all the
NoSQL databases would require an article on it’s own. Let’s take a quick look
at the key-vales stores, which is one of the most common.
app_setting_width 450user1923_color Reduser1923_age 18user3371_color Blueuser4344_color Brackishuser1923_height 6' 0"user3371_age 34error_msg_457 There is no file %1 hereerror_message_1 There is no user with %1 name1923_name Jimuser1923_name Jim Smithuser1923_lname SmithApplication_Installed truelog_errors 1install_path C:\Windows\System32\RestrictedServerName localhost
As
you can see above, the keys and the values are completely free, and can be used
in any way the programmer or the DBA decides.
Data
is queried using APIs. Basically from an attribute name the system will get the
associated value. Some APIs allow searching the values as well.
Note;
NoSQL
databases are quite new, and the market is still in its infancy, hence the
great number of different products and implementations
,
both commercial and open source.
,
both commercial and open source.
No comments:
Post a Comment