IBM Technical Exploration Center
February 1, 2017
9:00 AM - 5:00 PM
From the promotional materials:
IBM dashDB Local MPP Proof of Technology
IBM dashDB Local is a next-generation data warehousing and analytics technology for use in private clouds, virtual private clouds and other container-supported infrastructures. It is ideal when you must maintain control over data and applications, yet want cloud-like simplicity.
dashDB Local is part of a family that shares a common SQL engine and other database technologies across different deployments so you can place the right workload on the right platform—and move between them without application change.
It includes in-memory processing for delivering insanely fast answers to queries, as well as MPP for scale out and scale up capabilities as data grows. It also provides scale-in capabilities to free resources after peak workloads complete.
Agenda:
- Setup and Introduction - Introduction to dashDB Local
- GPFS FPO - Create GPFS FPO feature for clustered file system
- dashDB MPP Installation - Installation of Docker containers for dashDB in MPP deployment
- dashDB and Docker Administration - dashDB and Docker administration for day-to-day work
- dashDB Web Console - Use of dashDB web Console for data load, administration of tables, Run SQL
- Query Monitoring - Comprehensive monitoring of Analytics workload
My key takeaways:
- Vikram Khatri was the presenter. He works for IBM out of North Carolina.
- This workshop was the second stop in a 4 to 5 stop roadshow which started in New York.
- For the workshop, he wrote a detailed (roughly 100 page) tutorial that contains step by step instructions as well as dozens of SQL and Docker scripts which he plans to distribute (the Docker portion is of value for non-related work as well).
- In addition to dashDB, we spent considerable time discussing Docker, since dashDB Local MPP is a Docker based database.
- He was able to answer a bunch of my questions, which was helpful, since I have run across some conflicting information on the web.
- Long story short, dashDB is *not* a hybrid database as in HTAP (hybrid transactional / analytical processing).
- A preemptive decision needs to be made as to whether transactional or analytical processing is needed for a database.
- The two separate database products are dashDB Local for transactions (row-based), and dashDB Local for analytics (column-based), but the former has not yet been released.
- There is also another product called dashDB for cloud. Unlike the other two products, this product is bare metal rather than Docker based, although the plan is to move this to Docker as well.
- He has the same opinion as me that Hadoop is on the decline and will be largely replaced by HTAP etc database products in coming years.
- He noted that this opinion goes against IBM marketing for BigInsights, which is a Hadoop-based product.
- While walking through the history of database products, he commented that companies have wasted millions of dollars on Hadoop, only to go back to relational databases.
My personal notes:
- the workshop description can be found at this link
- presenter was Vikram Khatri from IBM (North Carolina)
- event organizer was Rob Beal from IBM (Chicago)
- apparently about 25 to 30 people registered to attend this event, but less than 10 showed up
- the first 3 of 6 lab exercises (lab01, lab02, lab03) were already performed for clusters beforehand to save time on downloading, installing etc
- at the outset, Vikram commented that individuals with DB2 experience should see some familiarity with what is presented during the workshop
- labs require only basic skills in Linux and databases
- the clustered file system on each workshop machine is visible to all 4 machines/nodes
- most labs will be performed from node dash01, but dash02, dash03, and dash04 can also be used
- most labs will be performed via command line using Gnome terminal and root user...just type 'root'
- at command prompt, # is root and $ is user
- dash01 is the head node...the others are data nodes
- the focus today is dashDB, but IBM has a complete portfolio including Cloudant and BigInsights for Apache Hadoop
- Cloudant is a document store comparable to MongoDB
- "relational databases are not going away...they're going to stay"
- Vikram mentioned prior solutions which tried to fit data into relational databases using BLOBs and CLOBs, but some gaps are best addressed by non-relational databases
- he doesn't see a big market for Apache Hadoop or BigInsights...these are on the decline now
- he thinks that HTAP etc will replace Hadoop
- industry is moving away from gathering different components and fitting them together, to moving to ready-made, self-contained platforms like dashDB
- we grew up in the structured data world
- we started to build on top of these tables for analytical purposes, which led to star schemas, snowflake schemas etc
- and we were doing this on row organized tables, retrofitting for about 15 years
- but this didn't look right...why read the whole system of record?...so column organized tables were devised
- then these databases started to be used for OLTP, which was a complete mistake
- dashDB for Analytics provides ACID compliance guarantees
- companies have wasted millions of dollars on Hadoop, only to go back to relational databases
- "sometimes we make stupid decisions not thinking things through"
- dashDB for Analytics and DashDB for Transactions are apparently two separate products
- I confirmed with Vikram that dashDB should not be considered an HTAP product, since a preemptive decision needs to be made between transactional and analytical purpose
- it was interesting that attendees asked what I meant by "hybrid database"
- this whole movement started with NoSQL, followed by not-only-SQL
- Vikram reiterated that he doesn't see the future in BigInsights / Hadoop...he commented that this is his opinion, not the opinion of IBM
- while these technologies are still relevant, they were applied for the wrong purposes
- some questions from the audience resulted in Vikram saying that we're not trying to force MongoDB document data into DB2
- he reiterated that we've been able to store document data etc in relational databases for years using BLOBs and CLOBs
- but he followed this up by saying that it's going to be the document store which is the best for storing this type of data
- "it took IBM 25 years to figure out how to make money from something that is free"
- IBM made a mistake by pursuing token ring rather than TCP/IP a long time ago...and look where Cisco is now
- I asked him whether the IBM "open data platform" is synonymous with open source, and elaborated by saying that IBM is packaging open source products together
- another attendee clarified by saying that the open data platform is an industry standard to which IBM and Hortonworks et al conform, but not Cloudera et al
- "the new reality here is software as a service"
- someone else asked about Bluemix Data Connect...this is for data integration...Watson is available via this product
- noticing some confusion, Vikram commented that "we have been changing names rapidly"
- "Watson is not a product...it's a platform providing many different things"
- "inside dashDB, it's all DB2"
- if you know DB2, you can use it like DB2, but if you don't it will do things for you by focusing on the data, not administration
- most of the administration is taken away, but there is still some administration to do
- Vikram pointed to me a few times after I mentioned HTAP databases...he pointed to me again and said we will see more and more of these capabilities in the future
- "it remains to be seen which direction the industry goes"
- a new product will be called "dashDB Cloud"
- dashDB Local is a Docker container based technology...dashDB Cloud is deployed to a bare metal server
- dashDB for Transactions is row-based DB2...dashDB for Analytics is column-based DB2
- as of right now, there is a 1:1 relationship between Docker images and machines/nodes...multiple form a cluster (clarified 7/5/2017: see comment)
- only available right now on Redhat Linux 7.2...this workshop uses CentOS 7.2 which is compatible
- data is encrypted at rest...someone asked about DB2 11, so I asked whether he meant data is encrypted by default with dashDB, and said correct...with DB2 it needs to be turned on
- Apple has been using Time Machine for a long time...a "union file system...Docker is a union file system as well
- Docker is a container based technology
- the difference with Docker is that this technology was made "very simple"
- an instance of a Docker image is called a container...a container is an instantiation of an image
- Vikram went into some basics about Docker, and elaborated a bit more when someone joked that they're new to Docker but realizes this isn't a Docker class
- the best example that he gave for Docker image layers is how an image can use a different version of GCC or some other file than the host operating system beneath
- Vikram concurred with me that this is what provides consistency across machines
- dashDB Local requires a clustered file system
- IBM GPFS, GFS2, or VXFS can be used...these are all POSIX compliant clustered file systems
- someone commented that GPFS is expensive, to which Vikram replied "GPFS is not part of the dashDB license...GPFS provides a lot of good things...forget about the price right now"
- GPFS can also be used for traditional DB2
- "you will be amazed at what GPFS can do"
- this workshop uses IBM GPFS with FPO...file placement optimizer
- with GPFS you don't need a SAN
- GPFS provides multiple copies of data across machines
- GPFS provides significantly better performance than HDFS...it's not new, been around 25 years
- dashDB Local for Transactions is not available yet...just dashDB Local for Analytics
- while going over the steps to get access to a Docker image, the speaker commented that it can take up to 24 hours to get a Docker ID via https://cloud.docker.com because a human is involved on the other end
- to get going, you just need to pull Docker images to each node, create the "nodes" file on each that lists IP addresses of each node, and then start up dashDB on the head node
- database names are chosen for you
- when you see "You're almost there" in the tailed log, you can hit Control-C to start the dashDB MPP process on the head node
- after you see "Congratulations! You have successfully deployed dashDB" in the log, you are ready
- the workshop slides mention version 1.4.1, but Vikram mentioned that he needed to work late the prior night to move to version 1.5.0 to all of the machines in the lab due to the recent release
- his response to my asking how to upgrade was that he would cover this topic later, but he also commented that it is "much easier" to do so than for DB2
- the version of the software stack is located at /mnt/clusterfs/SystemConfig/'hostname -s'/.build_version
- as we breaked for lunch we started lab04 - dashDB and Docker Administration
- during lab04, I asked Vikram whether all 4 nodes were running on my physical machine, and he said yes...the machine had 32GB RAM and each node was allocated 7GB RAM
- at one point I needed to restart the head node because I had run some commands to add a node that were not supposed to be run
- in order to get dashDB running again, I needed to issue the following commands: 'docker exec dashDB stop' followed by 'docker exec dashDB start'
- working through lab05 - dashDB Web Console was next
- Vikram told me that he is based in North Carolina...this is the second time he has presented his materials...a couple more stops on his roadshow...New York was first
- you have complete control over the Docker images...a container is an instance of an image
- you have the key to the front door of the Docker container...the following provides an interactive shell within the container: 'docker exec -it dashDB bash'
- in a cloud environment you cannot do this
- remember, Docker became popular because of the union file system
- virtual machines are much more expensive in terms of resources than Docker...Docker is much more efficient than virtual machines
- virtual machines are not going to go away, just like COBOL is not going to go away, but growth in terms of adoption will disappear
- right now, IBM is releasing on a monthly basis
- Vikram reminded IBM database administrators in the room that learning Docker and dashDB are some good steps to reinvent themselves
- he commented that if our organizations do not have a Docker repository yet, they are behind...and "laggards will suffer"
- "dashDB exceeded my expectations"
- "financial customers typically want to be cutting edge"
- "retailers are the last...they don't want to do anything unless they need to"
- dashDB Local was released about 6 months ago
- I asked Vikram twice if he can give us a sense of how many customers exist...and then the ratio between DB2 and dashDB customers...but he didn't delineate these figures
- "I was late into the game adopting Docker myself...that was my ignorance"
- to check the resources that are running: 'docker top dashDB'
- NGINX is a small print web server that is often used to route to Docker containers
- the big difference between virtualization and Docker is that Docker uses the kernel from the host
- dashDB Cloud is also moving to containers...right now it is bare metal...if you want to increase nodes, for example, IBM staff currently does so for you
- the db_migrate command can be used to transfer data from Netezza
- minor upgrades are easy...in response to my questions about upgrade risk, Vikram said that by the time major upgrades become available IBM will have a way to do major upgrades
- things you typically do via a SAN administrator can be done yourself via GPFS
- while working through lab05, I ran into an issue performing step 5.10..."unable to find user root: no matching entries in passwd file"...Vikram already opened up a ticket with Docker...the temporary fix is to run this command from another node
- I finished lab04 and lab05, but didn't complete lab06 ("dashDB query monitoring")