Friday, August 11, 2017

Fast Data Pipeline Design: Updating Per-Event Decisions by Swapping Tables

Fast Data Pipeline Design: Updating Per-Event Decisions by Swapping Tables

VoltDB was one of the first companies to enable a new modern breed of applications, applications that combine streaming, or “fast data”, tightly with big data.We call these applications Fast Data Pipelines.
First, a quick high-level summary of the fast data pipeline architecture:
Fast Data Pipeline

The first thing to notice is that there is a tight coupling of Fast and Big, although they are separate systems. They have to be, at least at scale. The database system designed to work with millions of event decisions per second is wholly different from the system designed to hold petabytes of data and generate Machine Learning (ML) models.
There are a number of critical requirements to get the most out of a fast data pipeline. These include the ability to:
  • Ingest / interact with the data feed in real-time.
  • Make decisions on each event in the feed in real time
  • Provide visibility into fast-moving data with real-time analytics
  • Seamlessly integrate into the systems designed to store Big Data
  • Ability to deliver analytic results (mined “knowledge”) from the Big Data systems quickly to decision engine, closing the data loop. This mined knowledge can be used to inform per event decisions.
Hundreds of Fast Data Pipeline applications have been built and deployed using VoltDB as the fast operational database (the glue) between Fast and Big. These applications provide real-time decisioningengines in financial fraud detection, digital ad tech optimization, electric smart grid, mobile gaming and IoT industries, among others.
This blog is going to drill into how to implement a specific portion of this fast data pipeline, namely the last bullet: the ability to close the data loop, taking knowledge from a Big Data system and applying this knowledge, online, to the real-time decision engine (VoltDB).

Closing the Data Loop

“Per-event decisioning” means that an action is computed for each incoming event (each transaction).  Usually some set of facts informs the decision, often computed from historical data. These “facts” could be captured in machine learning models or consist of a set of generated rules to be executed on each incoming event. Or these facts could represented as rows in a database table, used to filter and generate optimized decisions for each event. This blog post will focus in on the latter, storing and updating facts represented in database tables.

When storing facts in database tables, each row corresponds to some bit of intelligence for a particular value or set of values.  For example, the facts might be a pricing table for airline flights, where each row corresponds to a route and service level.  Or the values might be list of demographic segmentation buckets (median income, marital status, etc) for browser cookies or device ids, used to serve up a demographic-specific ads.

Fact tables are application-specific, can be simple or sophisticated, and are often computed from an historical “big data” data set such as Spark, Hadoop, or commercial data warehouse, etc.  Fact tables can often be quite large and can be frequently recomputed, perhaps weekly, daily, or even hourly.

It is often important that the set of facts changes atomically.  In other words, if airline prices are changing for ten’s of thousands of flights, all the prices should change all at once, instantly. It is unacceptable that some transactions reference older prices and some newer prices during the period of time it takes to load millions of rows of new data.  This problem can be challenging when dealing with large fact tables as transactionally changing millions of values in can be a slow, blocking operation. Locking a table, thus blocking ongoing operations, is unacceptable when your application is processing hundreds of thousands of transactions per second.

VoltDB solves this challenge in a very simple and efficient manner.  VoltDB has the ability to transactionally swap tables in a single operation.  How this works is as follows:

  1. Create an exact copy of your fact table schema, giving it a different name. Perhaps Facts_Table and Facts_Table_2.

  1. Make sure the schemas are indeed identical (and neither is the source of a view).

  1. While your application is running (and consulting rows in Facts_Table to make decisions), populate Facts_Table_2 with your new set of data that you wish future transactions to consult. This table can be populated as slowly (or as quickly) as you like, perhaps over the course of a day.

  1. When your Facts_Table_2 is populated, and you are ready to make it “live” in your application, call the VoltDB System Procedure @SwapTables. This operation essentially switches the data for the table by swapping internal memory pointers. As such it executes in single to sub millisecond range.

  1. At this point, all the data that was in Facts_Table_2 is now in Facts_Table, and the old data in Facts_Table now resides in Facts_Table_2.  You may consider truncating Facts_Table_2 in preparation for your next refresh of facts (and to reduce your memory footprint).

Let’s look at a contrived example using the VoltDB Voter sample application, a simple simulation of an ‘American Idol’ voting system. Let’s assume that each day you are going to feature different contestants for which callers can vote. Voting needs to occur 24x7, each day, with new contestants. The contestants change every day at midnight. We don’t want any downtime - no maintenance window, for example  - when changing our contestant list.

Here’s what we need to do to the Voter sample to effect this behavior:

  1. First we create an exact copy of our CONTESTANTS table, calling it CONTESTANTS_2:

-- contestants_2 table holds the next day's contestants numbers -- (for voting) and names
CREATE TABLE contestants_2
(
 contestant_number integer     NOT NULL
, contestant_name   varchar(50) NOT NULL
, CONSTRAINT PK_contestants_2 PRIMARY KEY
 (
   contestant_number
 )
);

2. The schemas are identical, and this table is not the source of a materialized view.

3. The Voter application pre-loads the CONTESTANTS table at the start of benchmark with the following contestants:

1> select * from contestants;
CONTESTANT_NUMBER  CONTESTANT_NAME
------------------ ----------------
                1 Edwina Burnam   
                2 Tabatha Gehling
                3 Kelly Clauss    
                4 Jessie Alloway  
                5 Alana Bregman   
                6 Jessie Eichman  

$ cat contestants_2.csv
1, Tom Brady
2, Matt Ryan
3, Aaron Rodgers
4, Drew Brees
5, Andrew Luck
6, Kirk Cousins

$ csvloader contestants_2 -f contestants_2.csv
Read 6 rows from file and successfully inserted 6 rows (final)
Elapsed time: 0.905 seconds
$ sqlcmd
SQL Command :: localhost:21212
1> select * from contestants_2;
CONTESTANT_NUMBER  CONTESTANT_NAME
------------------ ----------------
                1 Tom Brady       
                2 Matt Ryan       
                3 Aaron Rodgers   
                4 Drew Brees      
                5 Andrew Luck     
                6 Kirk Cousins    

(Returned 6 rows in 0.01s)

4. Now that we have the new contestants (fact table) loaded and staged, when we’re ready (at midnight!) we’ll swap the two tables, making the new set of contestants immediately available for voting without interrupting the application. We’ll do this by calling the @SwapTables system procedure as follows:

$ sqlcmd
SQL Command :: localhost:21212
1> exec @SwapTables contestants_2 contestants;
modified_tuples
----------------
             12

(Returned 1 rows in 0.02s)
2> select * from contestants;
CONTESTANT_NUMBER  CONTESTANT_NAME
------------------ ----------------
                6 Kirk Cousins    
                5 Andrew Luck     
                4 Drew Brees      
                3 Aaron Rodgers   
                2 Matt Ryan       
                1 Tom Brady       

(Returned 6 rows in 0.01s)


5. Finally, we’ll truncate the CONTESTANTS_2 table, initializing it once again ready to be loaded with the next day’s contestants:

$ sqlcmd
SQL Command :: localhost:21212
1> truncate table contestants_2;
(Returned 6 rows in 0.03s)
2> select * from contestants_2;
CONTESTANT_NUMBER  CONTESTANT_NAME
------------------ ----------------

(Returned 0 rows in 0.00s)

Note that steps 3-5, loading, swapping, and truncating the new fact table, can all be done in an automated fashion, not manually as I have demonstrated with this simple example.

Running the Voter sample and arbitrarily invoking @SwapTables during the middle of the run yielded the following results:

A total of 15,294,976 votes were received during the benchmark...
- 15,142,056 Accepted
-   152,857 Rejected (Invalid Contestant)
-        63 Rejected (Maximum Vote Count Reached)
-         0 Failed (Transaction Error)

Contestant Name Votes Received
Tom Brady      4,472,147
Kirk Cousins 3,036,647
Andrew Luck      2,193,442
Matt Ryan      1,986,615
Drew Brees      1,963,903
Aaron Rodgers 1,937,391

The Winner is: Tom Brady

Apologies to those not New England-based! As you might have guessed, VoltDB’s headquarters are based just outside of Boston, Massachusetts.

Just the Facts, Ma’am

Leveraging big data intelligence to make per-event decisions is an important component of a real-time decision engine within your data pipeline. When building fast data pipeline applications using VoltDB, VoltDB provides tools and functionality to make this process easy and also painless to a running application. Two key tasks need to be performed: loading your new fact table into VoltDB, and atomically making that new data “live” to your business logic.
Loading data into VoltDB from an external data source can be done easily via a couple of approaches: you can use one of our loaders such as the CSV, Kafka or JDBC loader; or you can write an application to insert the data.
Swapping tables in VoltDB is a trivial exercise with the @SwapTable system procedure. And most importantly, swapping in new fact table data does not impact ongoing stream processing.

Wednesday, August 9, 2017

Top 5 Ways to Better Use Your Data


This post originally appeared on VoltDB.com in May, 2016. 

 Top 5 Ways to Better Use Your Data


Sir Francis Bacon is said to have coined the phrase “scientia potentia est”, translated as ‘knowledge is power’. Four hundred years later, we might rewrite the phrase as “data potentia est”.  Data is power - so how can you better use your organization’s data?  Here are five suggestions to help you drive value from your organization’s data - quickly.

1)    Know what you have.

First and foremost, inventory your data. 

There are two types of data to identify. First is historical data. Historical data is data you’ve accumulated over years of doing business. This could include databases, files, spreadsheets, presentations, transactions, logs, etc. The second type of data is the data that is being created “right now” - this is real-time data. Real-time data potentially has immediate value, and then ultimately turns into historical data.

Catalog and prioritize the data you have. Ideally, you will also want to identify the sources of data. Knowing how data is created allows you to capture it, store it, and eventually extract value from it, at the least cost to the organization, and with the best ROI.

The value of each type of data is different. Historical data allows you to analyze and mine past events. Real-time data gives you the opportunity to calculate analytics, possibly compare them to historical trends, and perform business actions in real-time, to capture additional and immediate value.

By way of example, consider a fraud prevention offering. Fraudulent transaction patterns are mined from historical data. These historical patterns are applied to real-time transactions to identify and reject suspected fraudulent transactions.

2)    Architect a data strategy that handles both Big and Fast data.

Creating a historical archive, perhaps a data lake via a Hadoop cluster, to store your data is only one step. Today enterprises create data at a tremendous - and growing - rate.  Processing and ingesting data in batch mode overnight is no longer acceptable. Real-time responsive enterprises need to process and react to data in seconds to minutes. Many organizations, including mobile operators, telecom providers, financial services organizations and advertising technology providers must respond in milliseconds.

3)    Choose the appropriate technologies.

There are a plethora of big data tools, and most are designed around best practices, optimized to extract value from both historical and real-time data.

Minimally you will need technologies for these areas:

Big Data: Typically, the main data management platform for big data is Hadoop or a data warehouse or perhaps a combination of the two to handle both structured and unstructured data. They act as the repository for all your data, often called the “data lake”. The data lake stores historical data to be analyzed and mined. 

Fast Data: Data is being created at a dizzying rate every day. Fast data is data that is being created now and is streaming into your company now.  It could be user clicks on your corporate web page or product downloads or any operational event occurring in your organization.  To deliver this fast data to the systems that can act on it, consider a message queueing systems such as Kafka.  To eliminate batch processing (slow data!) this message queue needs to deliver event data to an operational data stores capable of handling and processing messages at web-scale speed, thousands to tens of thousands to even millions of events per second.  The operational data store’s role is to ingest the data and process it in real-time.  Real-time processing can include computing real-time analytics, such as counts, aggregations and leaderboards, issuing real-time alerts, deduping, enriching and aggregating events, and making transactional decisions on an event-by-event basis. Both NewSQL and NoSQL operational stores can provide horsepower for handling real-time processing of event streams.  Modern operational data stores range from strongly-consistent SQL databases to eventually consistent key/value and document stores.  Consider numerous factors when choosing, including transactions as well as query interface, important for your data visualization tooling.

Data Visualization: Dashboards, charts, leaderboards, pivot tables, and visualizations all play a key role in understanding your data, both historical and real-time.

Historical visualization helps you explore, understand patterns, and create predictive analytics. Real-time visualizations help you understand the current state of your business, usually in the form of a real-time dashboard.

You will want to evaluate tools from vendors such as Tableau, Qlik and MicroStrategy for dashboarding and ad hoc visualizations -- user experience is a critical factor with this kind of software so having your users try it out is essential.

Data Science: A growing number of tools can help you extract information and insight from your data. Machine learning packages provide data classification, clustering, and regression analysis, and allow software to “learn” to identify and make predictions on data. Consider popular open source offerings such as Spark (MLlib) or R to get started.

4)    Build a Data Pipeline that delivers Data as a Service (DaaS) to internal customers.

Define an architecture that serves data to your internal customers. Capturing and analyzing the data is great, but it is only the first step. Data and insights must be readily available to consumers (people and applications) across your enterprise. Consumers of your data must be able to tap into both historical data from the data lake as well as real-time fast data, along with the insights derived from both together.

5)    Begin building applications to extract value from the data  -  then iterate.

Start small and add incrementally. Identify opportunities for small quick wins that will prove you can capture value from your data. Realize that data evolves and new patterns will emerge. Foster an environment of experimentation, innovation and continuous improvement and iterate on your data analysis.

Data is valuable. Batch processing is so 1990s. Now you’ve got five ideas for how to extract more value from your data. Start now and iterate. Think Big, of course, but also Think Fast.

Tuesday, August 8, 2017

Streaming Data: Billions and Billions (of transactions)

 This post originally appeared on VoltDB.com in October, 2016.

Billions and billions (of transactions)

Billions and billions[1] of transactions are executed each and every day by production-deployed VoltDB applications.
A back-of-the-napkin calculation puts the number north of 50 billion a day. The number is likely much, much higher. These billions of transactions are executed across a variety of innovative applications, applications not practical, or sometimes even possible, to build a decade ago. Chances are that you have touched VoltDB – without knowing it – today!
  • When you call someone on your cell, there’s a good chance a VoltDB application is approving the call.
  • If you have played some of the top 10 mobile games within the past few years, a VoltDB application likely processed your in-game events.
  • If you’ve run a road race recently, there’s a good chance a VoltDB application tracked your progress.
  • If you live in the UK and turn on a light switch, you are interacting with VoltDB as your electrical smart grid uses VoltDB to authorize and process grid and meter events.
  • If you play baccarat in a casino, there’s a good chance the bets placed and movement of chips on each gaming table is validated by VoltDB via in-chip RFID technology.
  • If you bought something in China with your credit card, it is likely that your transaction was verified as valid or detected as fraudulent by a VoltDB application.
The VoltDB engineering team has been building the world’s fastest in-memory database for over eight years now. It’s been a fun as well as challenging journey and we’ve learned a lot in the process. Some of the more notable learnings have been:
Batch is so 2000s. Legacy applications are moving from batch processing of feeds to real-time processing. New applications are being built to ingest, transact, and compute analytics on data streaming from all types of sources, such as IoT sensors, mobile devices, logs, etc. The world is moving to real-time and turning to products like VoltDB, which has the capability to handle high throughput (thousands to millions of events per second).
SQL [still] rules the roost.  VoltDB was founded around the same time as the NoSQL movement. At the time, building a distributed SQL data store seemed contrarian, given the NoSQL hype. Fast forward a half-dozen years and we now see many of the more popular NoSQL products offering SQL or SQL-like interfaces. Even the Hadoop ecosystem has added SQL interfaces to the data lake. SQL is here to stay(and, in fact, never left).
VoltDB’s commitment to SQL remains strong. Over the past year we’ve added windowing analytic SQL, geospatial SQL support (points and polygons), continuous queries via real-time materialized views, the ability to join streams of data to compute real time analytics, and the ability to very quickly generate an approximate count of the number of distinct values. More advanced SQL capabilities are made available nearly every month.
Expect the unexpected, operationally. When you build a highly-available, clustered database that runs 24×7 at high throughput, on bare metal or in the cloud, all types of error scenarios that could happen, dohappen. Disks fail, networks break down, power goes out, mistakes are made, lightning strikes.
Building a production-ready operational database is tough work. The bar is set high and not negotiable. The following features are table stakes for current and next generation applications:
  • High availability: Machines, containers and virtual machines can and do break, crash or die. Networks pause or partition. The database must handle these situations properly.
  • Correctness: Accuracy matters. Every client should see a consistent view of the data.
  • Predictable latency: Response time should be predictable, within 99.999% of the time.
  • Durability: Never lose data even in the face of nodes crashing or networks partitioning.
  • Performance: The database needs to scale with your application. Adding additional nodes should scale throughput smoothly and linearly.
To maintain these requirements across all scenarios, VoltDB Engineering constantly pounds on the database in our QA lab. We focus on randomizing errors of all types. The more we learn, the more we feel we need to test more. That’s one of the reasons we reached out to Kyle Kingsbury, (aka Aphyr), to independently apply his dastardly Jepsen test suite and methodology to VoltDB. As part of this effort, we found a few issues in the product before they caused issues in customer deployments. This effort remains one of the team’s proudest accomplishments of 2016.
But we know there’s always more work to do, and we’re committed to doing it.
Streaming analytics and operational data stores are merging into a new real-time platform
The architecture required for real-time processing of streaming data is evolving rapidly. When we started building VoltDB, a developer had to cobble together numerous pieces of technology, (e.g. ZooKeeper, Kafka, Storm, Cassandra), to process real-time streams of data.
Billions.png
VoltDB Founding Engineer John Hugg stated at a recent Facebook @Scale conference, “OLTP at scale looks a lot like a streaming application.”  Today we’re seeing the result of that evolution: a convergence of real-time ingestion, streaming analytics, and operational interaction into a single platform. While creating the next generation of database, the VoltDB engineering team has assembled the core components for building streaming fast data applications in VoltDB:
  • Real-time ingestion with in-process, highly-available importers reading from Kafka, Kinesis and other streaming sources;
  • Streaming analytics on live data via ad hoc SQL analytics as well as continuous queries; and
  • A blazingly fast in-memory storage and ACID transaction execution engine capable of accessing hot and historical data to make per-event decisions millions of times per second.
The VoltDB engineering team is not done building (or learning)! Our mission is to make building highly-reliable, high throughput fast data applications not only possible, as it is today, but also extremely easy. So stay tuned. Over the coming months we’ll be delivering additional geo-distributed and high availability features, additional importers and exporters for building fast data pipeline applications, and improving our real-time analytical SQL capabilities.
[1] With a hat tip to Johnny Carson and Carl Sagan