Vyoms OneStopTesting.com - Testing EBooks, Tutorials, Articles, Jobs, Training Institutes etc.
OneStopGate.com - Gate EBooks, Tutorials, Articles, FAQs, Jobs, Training Institutes etc.
OneStopMBA.com - MBA EBooks, Tutorials, Articles, FAQs, Jobs, Training Institutes etc.
OneStopIAS.com - IAS EBooks, Tutorials, Articles, FAQs, Jobs, Training Institutes etc.
OneStopSAP.com - SAP EBooks, Tutorials, Articles, FAQs, Jobs, Training Institutes etc.
OneStopGRE.com - of GRE EBooks, Tutorials, Articles, FAQs, Jobs, Training Institutes etc.
Bookmark and Share Rss Feeds

MySQL Database Analytics with InfiniDB from Calpont � Part 2 | Articles | Recent Articles | News Article | Interesting Articles | Technology Articles | Articles On Education | Articles On Corporate | Company Articles | College Articles | Articles on Recession
Sponsored Ads
Hot Jobs
Fresher Jobs
Experienced Jobs
Government Jobs
Walkin Jobs
Placement Section
Company Profiles
Interview Questions
Placement Papers
Resources @ VYOMS
Companies In India
Consultants In India
Colleges In India
Exams In India
Latest Results
Notifications In India
Call Centers In India
Training Institutes In India
Job Communities In India
Courses In India
Jobs by Keyskills
Jobs by Functional Areas
Learn @ VYOMS
GATE Preparation
GRE Preparation
GMAT Preparation
IAS Preparation
SAP Preparation
Testing Preparation
MBA Preparation
News @ VYOMS
Freshers News
Job Articles
Latest News
India News Network
Interview Ebook
Get 30,000+ Interview Questions & Answers in an eBook.
Interview Success Kit - Get Success in Job Interviews
  • 30,000+ Interview Questions
  • Most Questions Answered
  • 5 FREE Bonuses
  • Free Upgrades

VYOMS TOP EMPLOYERS

Wipro Technologies
Tata Consultancy Services
Accenture
IBM
Satyam
Genpact
Cognizant Technologies

Home » Articles » MySQL Database Analytics with InfiniDB from Calpont � Part 2

MySQL Database Analytics with InfiniDB from Calpont � Part 2








Article Posted On Date : Friday, January 15, 2010


MySQL Database Analytics with InfiniDB from Calpont � Part 2
Advertisements

In Part 1 of this article, I took you through some of the reasons why a column-oriented database based on MySQL can be compelling. We also examined the architecture of InfiniDB, which is an open source analytic, reporting, and read-intensive database, and how its modular design helps you scale both up and out. Now let's kick the tires of the database so you can see these things in action.

Loading Data

First, let's load some data into a basic database to work with. The below load and query tests were run on a small 32-bit Fedora Core 6 machine with 2GB of RAM, and 4 CPU's (2.40GHz with 512KB cache).

With InfiniDB, you have a high-speed loader to work with called cpimport that does a decent job at taking flat files and pushing their data into database tables. There are two primary steps to using the cpimport utility:

  1. Create a job file that is used to load data from one or more flat files into a database
  2. Run the cpimport utility to perform the data import

Note that bulk loads are an append operation to a table so they allow existing data to be read and remain unaffected during the process. Upon completion of the load operation, a high water mark in each column file is moved in an atomic operation that allows for any subsequent queries to read the newly loaded data.

You can create a load job to load one table at a time or load multiple tables from different flat files. Let's go ahead and load up an entire TPC-H database:

mysql> show tables;
+-----------------+
| Tables_in_tpch2 |
+-----------------+
| customer |
| lineitem |
| nation |
| orders |
| part |
| partsupp |
| region |
| supplier |
+-----------------+
8 rows in set (0.00 sec).

First, you put all your load files in InfiniDB's import directory:

[rschumacher@srvsn import]$ pwd
/usr/local/Calpont/data/bulk/data/import
[rschumacher@srvsn import]$ ls -l
total 1076072
-rw-rw-r-- 1 root calpont 24346144 Sep 25 11:39 customer.tbl
-rw-rw-r-- 1 root calpont 759863287 Sep 25 11:39 lineitem.tbl
-rw-rw-r-- 1 root calpont 2224 Sep 25 11:39 nation.tbl
-rw-rw-r-- 1 root calpont 171952161 Sep 25 11:39 orders.tbl
-rw-rw-r-- 1 root calpont 118984616 Sep 25 11:39 partsupp.tbl
-rw-rw-r-- 1 root calpont 24207240 Sep 25 11:39 part.tbl
-rw-rw-r-- 1 root calpont 389 Sep 25 11:39 region.tbl
-rw-rw-r-- 1 root calpont 1409184 Sep 25 11:39 supplier.tbl

Next, you create a job file that's used by the import operation. The job file is created with the colxml utility. The reason for the job file step is that oftentimes loads are repetitively done (especially in dev/test environments) and a job file can be reused many times to re-load data into the same objects. So, let's create our job file to import all our data into a database called 'tpch2', where the flat files use the default pipe ('|') character as the flat file field delimiter, and give our job a number of '500':

[rschumacher@srvsn bin]$ ./colxml tpch2 -j500
Running colxml with the following parameters:
2009-10-07 15:14:20 (9481) INFO :
Schema: tpch2
Tables:
Load Files:
-b 0
-c 1048576
-d |
-e 10
-f CSV
-j 500
-m 50
-n
-p /usr/local/Calpont/data/bulk/job/
-r 5
-s
-u
-w 10485760
-x tbl

File completed for tables:
tpch2.customer
tpch2.lineitem
tpch2.nation
tpch2.orders
tpch2.part
tpch2.partsupp
tpch2.region
tpch2.supplier

Normal exit.

Now, let's perform the actual import for all eight tables using the cpimport utility:

[rschumacher@srvsnp bin]$ ./cpimport -j 500

Bulkload root directory : /usr/local/Calpont/data/bulk
job description file : Job_500.xml
2009-10-07 15:14:59 (9952) INFO : successfully load job file /usr/local/Calpont data/bulk/job/Job_500.xml
2009-10-07 15:14:59 (9952) INFO : PreProcessing check starts
2009-10-07 15:15:04 (9952) INFO : PreProcessing check completed
2009-10-07 15:15:04 (9952) INFO : preProcess completed, total run time : 5 seconds
2009-10-07 15:15:04 (9952) INFO : No of Read Threads Spawned = 1
2009-10-07 15:15:04 (9952) INFO : No of Parse Threads Spawned = 3
2009-10-07 15:15:06 (9952) INFO : For table tpch2.customer: 150000 rows processed and 150000 rows inserted.
2009-10-07 15:16:12 (9952) INFO : For table tpch2.nation: 25 rows processed and 25 rows inserted.
2009-10-07 15:16:12 (9952) INFO : For table tpch2.lineitem: 6001215 rows processed and 6001215 rows inserted.
2009-10-07 15:16:31 (9952) INFO : For table tpch2.orders: 1500000 rows processed and 1500000 rows inserted.
2009-10-07 15:16:33 (9952) INFO : For table tpch2.part: 200000 rows processed and 200000 rows inserted.
2009-10-07 15:16:44 (9952) INFO : For table tpch2.partsupp: 800000 rows processed and 800000 rows inserted.
2009-10-07 15:16:44 (9952) INFO : For table tpch2.region: 5 rows processed and 5 rows inserted.
2009-10-07 15:16:45 (9952) INFO : For table tpch2.supplier: 10000 rows processed and 10000 rows inserted.
2009-10-07 15:16:45 (9952) INFO : Bulk load completed, total run time : 106 seconds

The cpimport utility pushed 8.6 million rows into our database in a little over 100 seconds, which isn't bad for a small 32-bit machine (about 82,000 rows/second). Now, let's run some query tests.

Querying Data in InfiniDB

To run queries against InfiniDB, you'll be using a mysql client interface named calpontmysql.

As nearly everyone likes to test query speed by issuing a count(*) against a table, let me start off by saying that InfiniDB does not keep a count of a table's rows immediately on hand, so it will always do an actual count of data:

mysql> select count(*) from lineitem;
+----------+
| count(*) |
+----------+
| 6001215 |
+----------+
1 row in set (0.81 sec)

With that caveat out of the way, let's now run a few queries and see what we see. Again, the database has no indexes or manual partitioning of data, so it's just pure column-oriented tables we're working with. Let's start with a 3-way table join between customer, orders, and lineitem:

mysql> select c_name, sum(l_quantity)
-> from customer, orders, lineitem
-> where c_custkey = o_custkey and
-> l_orderkey = o_orderkey and
-> l_shipdate = '1992-01-02'
-> group by c_name;
+--------------------+-----------------+
| c_name | sum(l_quantity) |
+--------------------+-----------------+
| Customer#000094199 | 35.00 |
| Customer#000146399 | 21.00 |
| Customer#000104149 | 20.00 |
| Customer#000081157 | 12.00 |
| Customer#000071485 | 27.00 |
.
.
.
17 rows in set (1.12 sec)

InfiniDB brings back the result set in a little more than a second, which isn't bad. Now, let's switch to another TPC-H database where I have 20 times the data in the lineitem table, which equates to 120 million rows instead of just 6 million and run the same query:

mysql> select c_name, sum(l_quantity)
-> from customer, orders, lineitem
-> where c_custkey = o_custkey and
-> l_orderkey = o_orderkey and
-> l_shipdate = '1992-01-02'
-> group by c_name;
+--------------------+-----------------+
| c_name | sum(l_quantity) |
+--------------------+-----------------+
| Customer#000094199 | 700.00 |
| Customer#000009263 | 380.00 |
| Customer#000146399 | 420.00 |
| Customer#000072862 | 100.00 |
.
.
.
17 rows in set (10.48 sec)

If we added CPU's to the box or increased the data cache size, we could reduce the response time further, or if we decided to use the Enterprise Edition option and go MPP, we can expect to cut the response time in half for one new node and in half again if we added two more nodes after that.

Now, let's try a full 8-way join with the smaller TPC-H database:

mysql> select c_name, p_name, ps_availqty, s_name,
-> o_custkey, r_name, n_name, sum(l_quantity)
-> from orders, lineitem, customer, part,
-> partsupp, supplier, nation, region
-> where o_orderkey = l_orderkey and
-> c_custkey = o_custkey and
-> p_partkey = l_partkey and
-> ps_partkey = p_partkey and
-> s_suppkey = ps_suppkey and
-> r_regionkey = n_regionkey and
-> s_nationkey = n_nationkey and
-> o_orderkey = 4161537
-> group by c_name, p_name, ps_availqty, s_name, o_custkey, r_name, n_name;
+--------------------+----------------------------------------+-------------+--------------------+-----------+-------------+---------------+-----------------+
| c_name | p_name | ps_availqty | s_name | o_custkey | r_name | n_name | sum(l_quantity) |
+--------------------+----------------------------------------+-------------+--------------------+-----------+-------------+---------------+-----------------+
| Customer#000122437 | slate light gainsboro dark seashell | 8343 | Supplier#000000597 | 122437 | AMERICA | UNITED STATES | 47.00 |
.
.
28 rows in set (5.68 sec)

And with the bigger TPC-H lineitem table (120 million rows):

mysql> select c_name, p_name, ps_availqty, s_name,
-> o_custkey, r_name, n_name, sum(l_quantity)
-> from orders, lineitem, customer, part,
-> partsupp, supplier, nation, region
-> where o_orderkey = l_orderkey and
-> c_custkey = o_custkey and
-> p_partkey = l_partkey and
-> ps_partkey = p_partkey and
-> s_suppkey = ps_suppkey and
-> r_regionkey = n_regionkey and
-> s_nationkey = n_nationkey and
-> o_orderkey = 4161537
-> group by c_name, p_name, ps_availqty, s_name, o_custkey, r_name, n_name;
+--------------------+----------------------------------------+-------------+--------------------+-----------+-------------+---------------+-----------------+
| c_name | p_name | ps_availqty | s_name | o_custkey | r_name | n_name | sum(l_quantity) |
+--------------------+----------------------------------------+-------------+--------------------+-----------+-------------+---------------+-----------------+
| Customer#000122437 | slate light gainsboro dark seashell | 8343 | Supplier#000000597 | 122437 | AMERICA | UNITED STATES | 47.00 |
.
.
28 rows in set (22.98 sec)

With this join, we see pretty good performance in the smaller 8-way join and not quite a 4x increase in response time for the larger table in the same 8-way operation, which again is not bad for our little 32-bit system.

One interesting item of note is that InfiniDB supports hash joins, which at the time of this writing, MySQL does not. Moreover, InfiniDB's hash joins can be distributed and processed in parallel when a multi-node, MPP configuration is used.

Let's now test InfiniDB's Extent Map to see if queries that use a value outside of a column's maximum and minimum value can be satisfied in the Extent Map alone and not have to read any actual data blocks:

mysql> select min(l_orderkey), max(l_orderkey) from lineitem;
+-----------------+-----------------+
| min(l_orderkey) | max(l_orderkey) |
+-----------------+-----------------+
| 1 | 6000000 |
+-----------------+-----------------+
1 row in set (1.25 sec)

mysql> select count(*) from lineitem where l_orderkey < 1;
+----------+
| count(*) |
+----------+
| 0 |
+----------+
1 row in set (0.04 sec)

mysql> select count(*) from lineitem where l_orderkey > 6000000;
+----------+
| count(*) |
+----------+
| 0 |
+----------+
1 row in set (0.04 sec)

The Extent Map seems to be doing its job here, which is good. There are few things more frustrating in my book than waiting a long time for a result set that ends up in zero rows being sent back.

I could run through other example queries, but the above should give you a decent first impression of how InfiniDB will handle data loads and queries.

When To and Not To Use InfiniDB

So what are some rules of thumb as to when you should and shouldn't use InfiniDB? In general, InfiniDB may be a help to you when you have:

  • Read-only databases or read-only parts of an application where response time speeds are important and row subsets (i.e. not all columns in a row) are being requested
  • Data marts/ warehouses/ analytic applications where query patterns change and cannot predicted. In other words, it is difficult to continually know what columns will be requested and will need indexing
  • Very large data volumes that are not served well by a general purpose database's optimizer
  • Scale out scenarios where you want to do parallel processing of queries across multiple machines (note: today, this is only possible with the Enterprise Edition of InfiniDB)

Places where you don't want to use InfiniDB include:

  • Transactional applications where you want InfiniDB to serve as a transactional back end database. Even though InfiniDB is transactional, the column-oriented nature of it means it will not do singleton/selective inserts and deletes as fast as a general purpose RDBMS will
  • Query systems where all the columns in a table are requested and a primary key lookup is performed
  • Databases that are very small in nature and are static/not growing






Sponsored Ads



Interview Questions
HR Interview Questions
Testing Interview Questions
SAP Interview Questions
Business Intelligence Interview Questions
Call Center Interview Questions

Databases

Clipper Interview Questions
DBA Interview Questions
Firebird Interview Questions
Hierarchical Interview Questions
Informix Interview Questions
Microsoft Access Interview Questions
MS SqlServer Interview Questions
MYSQL Interview Questions
Network Interview Questions
Object Relational Interview Questions
PL/SQL Interview Questions
PostgreSQL Interview Questions
Progress Interview Questions
Relational Interview Questions
SQL Interview Questions
SQL Server Interview Questions
Stored Procedures Interview Questions
Sybase Interview Questions
Teradata Interview Questions

Microsof Technologies

.Net Database Interview Questions
.Net Deployement Interview Questions
ADO.NET Interview Questions
ADO.NET 2.0 Interview Questions
Architecture Interview Questions
ASP Interview Questions
ASP.NET Interview Questions
ASP.NET 2.0 Interview Questions
C# Interview Questions
Csharp Interview Questions
DataGrid Interview Questions
DotNet Interview Questions
Microsoft Basics Interview Questions
Microsoft.NET Interview Questions
Microsoft.NET 2.0 Interview Questions
Share Point Interview Questions
Silverlight Interview Questions
VB.NET Interview Questions
VC++ Interview Questions
Visual Basic Interview Questions

Java / J2EE

Applet Interview Questions
Core Java Interview Questions
Eclipse Interview Questions
EJB Interview Questions
Hibernate Interview Questions
J2ME Interview Questions
J2SE Interview Questions
Java Interview Questions
Java Beans Interview Questions
Java Patterns Interview Questions
Java Security Interview Questions
Java Swing Interview Questions
JBOSS Interview Questions
JDBC Interview Questions
JMS Interview Questions
JSF Interview Questions
JSP Interview Questions
RMI Interview Questions
Servlet Interview Questions
Socket Programming Interview Questions
Springs Interview Questions
Struts Interview Questions
Web Sphere Interview Questions

Programming Languages

C Interview Questions
C++ Interview Questions
CGI Interview Questions
Delphi Interview Questions
Fortran Interview Questions
ILU Interview Questions
LISP Interview Questions
Pascal Interview Questions
Perl Interview Questions
PHP Interview Questions
Ruby Interview Questions
Signature Interview Questions
UML Interview Questions
VBA Interview Questions
Windows Interview Questions
Mainframe Interview Questions


Copyright © 2001-2024 Vyoms.com. All Rights Reserved. Home | About Us | Advertise With Vyoms.com | Jobs | Contact Us | Feedback | Link to Us | Privacy Policy | Terms & Conditions
Placement Papers | Get Your Free Website | IAS Preparation | C++ Interview Questions | C Interview Questions | Report a Bug | Romantic Shayari | CAT 2024

Fresher Jobs | Experienced Jobs | Government Jobs | Walkin Jobs | Company Profiles | Interview Questions | Placement Papers | Companies In India | Consultants In India | Colleges In India | Exams In India | Latest Results | Notifications In India | Call Centers In India | Training Institutes In India | Job Communities In India | Courses In India | Jobs by Keyskills | Jobs by Functional Areas

Testing Articles | Testing Books | Testing Certifications | Testing FAQs | Testing Downloads | Testing Interview Questions | Testing Jobs | Testing Training Institutes

Gate Articles | Gate Books | Gate Colleges | Gate Downloads | Gate Faqs | Gate Jobs | Gate News | Gate Sample Papers | Gate Training Institutes

MBA Articles | MBA Books | MBA Case Studies | MBA Business Schools | MBA Current Affairs | MBA Downloads | MBA Events | MBA Notifications | MBA FAQs | MBA Jobs
MBA Job Consultants | MBA News | MBA Results | MBA Courses | MBA Sample Papers | MBA Interview Questions | MBA Training Institutes

GRE Articles | GRE Books | GRE Colleges | GRE Downloads | GRE Events | GRE FAQs | GRE News | GRE Training Institutes | GRE Sample Papers

IAS Articles | IAS Books | IAS Current Affairs | IAS Downloads | IAS Events | IAS FAQs | IAS News | IAS Notifications | IAS UPSC Jobs | IAS Previous Question Papers
IAS Results | IAS Sample Papers | IAS Interview Questions | IAS Training Institutes | IAS Toppers Interview

SAP Articles | SAP Books | SAP Certifications | SAP Companies | SAP Study Materials | SAP Events | SAP FAQs | SAP Jobs | SAP Job Consultants
SAP Links | SAP News | SAP Sample Papers | SAP Interview Questions | SAP Training Institutes |


Copyright ©2001-2024 Vyoms.com, All Rights Reserved.
Disclaimer: VYOMS.com has taken all reasonable steps to ensure that information on this site is authentic. Applicants are advised to research bonafides of advertisers independently. VYOMS.com shall not have any responsibility in this regard.