Thursday, December 8, 2016

Big Data Technology Impacts of NoSQL, Distributed Data Stores...

The last 30 years have brought dramatic change to the world and technology.  Moore's law and the internet's relentless march connect more devices than there are people on the planet. Cisco estimates 50 billion connected devices by 2020.  To many cloud computing, big data and NoSQL tools sit hidden behind Facebook, Google and ubiquitous smart phone apps . Even those in IT may sometimes struggle to synthesize these innovations.

In this post we'll take a brief look at the implications of some of these changes on traditional business intelligence and analytic capabilities. Think of this as a quick survey of the landscape and not a deep dive into its many facets. Let's start with a seemingly esoteric NoSQL capability: graph processing and databases.

Graph Processing and Databases

We benefit from graph technology every day. Let's look at a visualization and some use cases.


Real time recommendations on retail web sites, the sometimes creepy "friend recommendations"  on Facebook.  Both echo chamber and gold mine: social network analytics for marketing and advertising. GPS applications like Waze or google maps. A little less obvious are use cases like network and operations root cause analysis, contextualizing real time streams of event and other data from the internet of things' sensors. Finally, saving money and sometimes lives via fraud detection, cybersecurity, and medical research.

When the relationships of things to each other are as important as the things themselves - graphs are attractive. Beyond the ubiquity of this hidden technology notice that unlike traditional BI and analytics for some key use cases, e.g. recommendations, identity and access management, it can be real time.

Before we move on to look at NoSQL more generally, it's important to recognize a few things about graph processing and graph databases. Graph databases often compete as OLTP tools e.g. Neo4j. Large scale graph processing is more an OLAP capability e.g. Giraph, Pregel, so the graph's data may sit in another NoSQL distributed data store. Having said that, the intuitive nature of the graph database lends itself to OLAP.

Depending on the scale and type of the analysis graph queries using languages like cypher or gremlin are often dramatically simpler to write than similar queries in SQL. Also, queries across many layers of related things perform orders of magnitude faster in a graph database than in a relational database.

Before we go any farther we'll need to have at least a passing understand of how NoSQL enables big data.

NoSQL - A Quick Overview

What's happened to the "relational" world? Data warehouse data marts and multi-dimensional cubes for OLAP, along with SAS, once ruled the BI and data analysis landscape.  Yet open sourced projects like R, Hadoop, Cassandra, Spark are in the forefront of the now "big" data world.  Even SQL database OLTP dominance may be challenged by the modern graph databases like Neo4j's ACID support, ease of use, and lightening speed for some use cases. 

NoSQL - could anyone have picked a worse or more misleading name? Every major cloud provider offers SQL based API's to many of their "noSQL" data sources.  SQL tends to connote relational data stores and tables. Yet,  graph databases are intrinsically relational and the open sourced versions of google's distributed, non-relational data store (now categorized as Column based noSQL) is called BigTable/MapReduce. Oh well, at least Column based noSQL is not really tabular.

NoSQL type characteristics overlap suggesting innovation will continue. My advise is to think "not only SQL" when you think of NoSQL.

Wikipedia's current taxonomy of relevant types of NoSQL along with example implementations.
  • Column: Cassandra, HBase
    • This type is often part of a distributed data store which has many benefits beyond the scope of this conversation. For now we'll just say it allows reliable, fast querying agains huge amounts of data across many computers, or nodes, at once
    • Map Reduce often handles querying where "map" sorts and filters and "reduce" summarizes.  Other common query tools include Hive and Pig
    • A column of a distributed data store is its lowest level object. It is a tuple (a key-value pair) consisting of three elements: A unique name, a value (or set of values), and a time stamp to determine if the content is valid or stale.
    • Example columns: 
      • street: {name: "street", value: "1234 x street", timestamp: 123456789},
      • city: {name: "city", value: "san francisco", timestamp: 123456789}
    • This data often resides on a Hadoop filesystem and may be performance optimized via Spark
  • Graph: Neo4J, Apache Giraph, MarkLogic
    • Graph databases uses nodes which represent entities like person, edges which represent relationships, and Properties which can be associated with both nodes and edges
    • Working with graph databases is generally intuitive as the storage model directly reflects natural language rather than being burdened with the physical implementation details of most other data stores e.g. relational or column
    • Relational databases use keys to relate entities to one another leading to "joining" one to many tables. Graph databases use pointers to relate one entity to another and may capture properties about the relationship.  The deeper the levels of relationships the more this database differentiates itself from relational stores.
    • Compared with relational databases, graph databases are often faster for associative i.e. highly related, data sets and map more directly to the structure of object-oriented applications. They can scale more naturally to large data sets as they do not typically require expensive join operations. As they depend less on a rigid schema, they are more suitable to manage ad hoc and changing data with evolving schemas. 
    • Conversely, relational databases are typically faster at performing the same operation on large numbers of data elements.
  • Multi-model: MarkLogic, OrientDB
    • Supports multiple data models against a single, integrated backend. Document, graph, relational, and key-value models are examples of data models that may be supported by a multi-model database.
    • An article by Martin Fowler suggests this type will become more prevalent over time Polyglot Persistence
  • Document: MarkLogic, MongoDB
    • A document-oriented database, or document store, is designed for storing, retrieving, and managing document-oriented information
    • Graph databases are similar, but add another layer, the relationship, which allows them to link documents for rapid traversal.
  • Key-Value: Redis, Oracle NoSQL
    • Manages data in associative arrays, a data structure more commonly known today as a dictionary or hash. 
    • Dictionaries contain a collection of objects, or records, which in turn have many different fields within them, each containing data. These records are stored and retrieved using a key that uniquely identifies the record, and is used to quickly find the data within the database.
    • Very easy for developers to use e.g. to persist objects.  Computationally powerful for some use cases. Some graph databases underlying implementations are key-value store.

Dealing with the Big in Big Data

A simple example:
Let's take a simple comparison: doing analysis of invoices versus sensor data from the internet of things. Traditionally for invoices you might extract the invoice data from your transactional data store, transform it into dimensions and facts, load it into your data mart to run reports or do sophisticated analysis by loading it into a cube. For gigabytes of data this works great (if requiring a lot of money to set up - more on that in a minute). 

Sensor data sets are often much, much larger, think terabytes. Yet, we can simplify and say the data has many dimensions (e.g. time, sensor type) and sensor readings (facts) to assess.  Using ETL to create a data mart is impractical. Enter "big data" with its distributed data stores and analytic capabilities.

Let's take a more accessible, if over simplified, example than sensor data. Perhaps Macy's wants to do analysis of the last 50 years of sales invoices.  Rather than put all of that data into a data mart they might leverage the distributed computing power of Hadoop and MapReduce. They could create a text file for each invoice and then distribute the invoices across many data stores in the cloud. Next they'd leverage MapReduce to do their summarizations across many computers in parallel.

Turns out this processing can be done very quickly. Programming is involved but a no where near the cost of using a more traditional approach. The aggregations etc. that this processing does can then be fed into a tool like SAS (or R) to do more in-depth, ad-hoc analytics. 

Getting more granular:
The term "text file" above could use some expansion. These could be realized as NoSQL columns, hive "tables" etc.  Hive is a great example of why we might choose "not only SQL" as the expansion of the term NoSQL. It exposes a SQL API to interact with virtual tables.  

As the complexity of the problem solved grows so does its implementation details. For example, designing the structure of an HBASE database is non-trivial.  

In the simple example above we say the outputs of MapReduce can then be input to a tool like SAS for ad-hoc querying.  Apache SPARK enables Hive to present users complex, OLAP style ad-hoc query capabilities agains huge distributed data stores. 

Spark creates a flexible abstraction called the "resilient distributed dataset" that aggregates data across many computers in a cluster. This aggregation overcomes MapReduce's limitation of requiring a linear data flow where you map a function across data and then reduce the results onto disk. It overcomes this limit by creating shared memory across the computers in the cluster enabling iterative passes over the data. Put more simply, it brings OLAP (and machine learning capabilities) to cloud tools including Hadoop's HDFS, Cassandra, Amazon S3, Open Stack's Swift...

However, Spark also necessitates additional complexity such as the use of cluster managers. It has its own native manager as well as provides support for Apache Mesos and Hadoop YARN.  Once again we see google innovation at work. Mesos conceptually descends from google's omega scheduler. It has used Omega to manage its services at scale. We'll stop here as cluster managers are a topic onto themselves.

Finishing up

This post summarizes a huge space in a few pages. I hope you found it useful. I've tried to describe a very complicated space in a simple manner.  If I've misrepresented rather than simplified, please comment so I might update this post.

So we started big lets end big.  Facebook's network of friends has caused it to be a pioneer in the graph processing. They have scaled graph processing up to handle a trillion edges ie. relations! Read about that here.  Here's a visual of their conceptual architecture.



Monday, December 5, 2016

Big Data Business Intelligence and Visualization 2016

Early in my career at Accenture I spent a lot of time consulting in Enterprise Information Management or more specifically: Content Management and Publishing (IA), Data Strategy and Governance, and bit in BI.  After spending some years in mostly Agile, OLTP, and SOA, job opportunities present themselves that have me looking into the space again. This post is the first of a few on developments here. This one looks at Gartner's 2016 vendor POV and some self-service Data Visualization vendors.

With the advent of big data technologies like Hadoop, Map Reduce, Spark, Graph DB's, Neo4JPregel... ....data visualization, multi-dimensional networks, IOT and cloud this space is ripe with opportunity. Gartner's 2016 Magic Quadrant for Business Intelligence and Analytics Platforms illustrates some of this. They have fundamentally retooled how they look at the space.

Gartner's view is generally: 


  1. The market is moving from centralized IT BI capability delivery to decentralized self-service business capabilities often baked into business processes. Business self-service data preparation and analysis is a good example. 
  2. From a technology perspective: "By 2018, smart, governed, Hadoop-based, search-based and visual-based data discovery will converge in a single form of next-generation data discovery that will include self-service data preparation and natural-language generation."
  3. The vendor market is fragmented. Trying to rely on one big vendor will likely not meet all of your needs. In the near term businesses should assess their needs and find the products that match them.

Gartner's five use cases reflect this view:


  1. Agile Centralized BI Provisioning
  2. Decentralized Analytics
  3. Governed Data Discovery
  4. Embedded BI
  5. Extranet Deployment
Before we look at the Magic Quadrant let's look at overall vendor group's performance against Gartner's 2016 "Critical Capabilities for Business Intelligence and Analytics Platforms" publication. It is interesting that the rankings, using a 1-5 scale, on average are not that good. I am not sure if this is a common occurrence in their assessment methodology. The rankings may simply reflect market conditions.

The group's only "excellent" capability average comes in "BI Platform Administration." Looks like cloud may be helping to level the playing field a bit for new entrants.  

The overall vendor results are similarly mixed:
Gartner provides customer reviews of many of these vendors here.

Gartner Magic Quadrant


Note that the Leaders rank only "fair" on average across Gartner's 14 critical capabilities.  Of those with an average rank of "good" most are in the Visionary quadrant and 45% in Niche.  Our deep pocket product vendors are mostly average "fair" as well. Interesting market right now.

Something to consider: What do you really want? 

Gartner assessment criteria are somewhat conservative focusing as much on vendor stability, size etc. as on the quality of the underlying product implementation.  To some extent that unmeasured quality  characteristic dictates the vendor's  ability to innovate and deliver capability differentiation. This suggests Gartner may be a poor guide if you are after differentiation. For example,  graph processing and databases as well as graph based analytics delivered via Cloud shows promise. Look at what Facebook, Ebay, AWS and Google have been up to. More on this in a future post. Another example is in the Data Visualization space. Let's take a quick look there now.

Data Visualization Product Landscape


Data Visualization is key to modern Analytics. Roughly half of the top products made the Gartner list. Again - makes sense given Gartner's assessment criteria - but it sure seems there is opportunity out there to leverage!


Additional References

Here is link to the google sheet used to do the analysis above. It has a bit more data and links to more as well.


Friday, November 25, 2016

Kanban: Successful Evolutionary Change for Your Technology Business

Kanban: Successful Evolutionary Change for Your Technology Business by David Andersen.

If you are an Agilist then this is a must read book. This is another case of my experiencing key concepts from a book prior to reading it. The importance of slack, focussing on cycle time, empowering teams...

I wish I would have read this the day it came out. I'm working on a summary now. It starts here.

Friday, November 18, 2016

DevOps Business Cases - Chatting with Robert...

I had a chance to catch up with my friend Robert Boyd. We had a nice chat about DevOps business cases. Prior to talking I threw together a Google Slides deck to have something for us to react to. It was a very nice chat and it turns out we were seeing things very much the same way. We made a couple of tweaks to the ideas as we talked, namely we added Robert's favorite three top down metrics at the end.



The deck would not imbed in a way that looked nice so this post has the content from the deck in simple outline form:

Top Down DevOps Benefits

  1. Reduce time to market - realize value faster, less coordination costs
  2. Hypothesis test value and growth assumptions using actual customer behavior, improve feature ROI
  3. Reduce/eliminate production incidents from changes, faster mean time to recover (MTTR):  avoid costs, build customer goodwill
  4. Reduce/repurpose operating costs- automate tasks across development, infrastructure and operations

 DevOps Levers

  1. Establish a Lean, self-improving culture
  2. Loosely coupled, intrinsically testable architectures, applications and services that are easy to change and scale
  3. Automated, self-service build pipelines of production like environments reduce environmental errors/defects
  4. Automated test suites with high test coverage to reduce defects
  5. Automated deployments to production reducing opportunities for error
  6. Enhanced telemetry tied to change events and self-service reporting reveal cause and effect thus improve MTTR
  7. Release automation and controls e.g. feature toggles, canary, blue green enabling flexible releases
  8. Split testing capabilities enabling customer hypothesis testing

 DevOps Enablers 

Organizational models

  • *Market based (value stream aligned) integrated teams (infra, ops, dev, test, product)
  • Functional (capability aligned, e.g. dev, ops, test, platform) deeply integrated lean process e.g. test first, ops requirements defined up front, phased production transition to run, platform team owns shared self-service capabilities
  • Matrix - mix of above two

Practices

  • TDD/ATDD (XP)
  • Scrum/Kanban
  • Lean analysis e.g. A3, Value Stream Analysis Five Why’s
  • Statistical analysis of operations
  • ...

Technologies

  • Cloud
  • Infrastructure as code
  • Continuous Integration
  • Deployment and release automation
  • Cluster management (e.g. improve utilization via oversubscription) 
  • ...

Three Top Down Metrics

Focus whole organization on a shared goal, improve:
  • Deploys/day
  • Prod issues/month
  • MTTR
These metrics create a punch list of inefficiencies in the system to attack via solid monitoring, test coverage, sustainable architectures etc.

Map the value stream, identify and subordinate constraints...

--------------------------------------------------------------------------------------------------------------------------
If you'd like to see the deck it's here:  DevOps Business Case Chat (looks nicer - I like Google Slides new "explore" formatting feature)...

The DevOps Handbook

While we didn't call it DevOps back in 2012, when I joined Gap Inc.,we working towards achieving some of its goals then. Over the ensuing years we got reasonably good at it. I read Gene Kim's book, The Phoenix Project, a couple years back and so was excited to learn he was working on a new book entitled The DevOps Handbook.

I pre-ordered the book on Amazon and wrote a four part summary as I read it. This is as must read book if your in IT today. The book is long so if you are in a hurry and want to read my take on the most important concepts the summary is for you. It also serves as a good review if you've read it and just want to keep it in your head.

Tuesday, November 15, 2016

2016 DevOps Vendor Landscape

I've been curious about the various DevOps tool vendors, beyond those my team's have used. So here is some research that I'd like to share. We'll start with the 2016 Gartner Magic Quadrant report and then add some to it.  There's a link to my google sheet with quick vendor comparisons and a link to the Gartner report too.

Gartner 2016 Magic Quadrant for Application Release Automation:


For the most part the leaders quadrant is dominated by the traditional players - IBM, Automic, CA Technologies each with >30 years experience. They've done a mix of building and acquiring capabilities. Two younger players Electric Cloud and Xebia Labs are both venture backed, private firms.

My initial preferences in the leaders quadrant are IBM and Xebia Labs.  They are both pretty end to end.  IBM's stateless architectural approach and Xebia labs somewhat unique agentless architecture make them jump out for me.

The visionary quadrant's sole member is Clarive. However, their implementation seemed convoluted. Hashicorp's potentially a visionary though. While they are open source and thus not for everyone their architectural vision resonates with me.  Their immutable infrastructure philosophy and agent and agentless capabilities stand out. I couldn't figure out their workflow paradigm in the time I spent looking.  They are too small to make the list at this point - but are worth looking at, especially if your comfortable in the open source world. Studying them may highlight relevant problems everyone's solving which are not often discussed in the big players marketing collateral.

Serena Software is the sole player in the Challengers quadrant.  While their offering is broad its also really heavy. They focus on highly regulated industries.

The stand out in the Niche quadrant for me is Puppet. Their test centric philosophy bodes well for them as they build out their offerings.

Atlassian - also not on the list - deserves special mention. If your interested in piecing together your own solution and your a development centric organization they may be a good fit. They also provide and opportunity to call out the smaller specialty players like Ansible, RunDeck, Chef etc.  There's a nice blog entry written by one of their Directors that is a good DevOps read.

Finally, a word about workflow.  My experience suggests we be wary of over subscribing to the "drag and drop" philosophy.  Xebia labs sums it up well in their 5 Reasons to Stay Away from Workflows technical note. Bottom line, DevOps at scale is complex. Dealing effectively with that complexity determines your level of success. Workflow can be leveraged as a tool to help visualize the broad strokes but becomes a spaghetti mess if you target "reuse" and assume "making it visible" allows people who don't understand software development to manage the complexity of DevOps implementations at scale.

Here's the google sheet with the high level comparisons and some comments on all of these and a few more players: DevOps Players Nov 2016.

Here's the Gartner report too.

Enjoy!

Friday, October 14, 2016

Badass: Making Users Awesome

I learned of Badass: Making Users Awesome by Kathy Sierra after Bob Payne referred me to the Modern Agile movement. Kathy's book seeks to answer one question: Given competing equally priced, equally promoted products or services, why are some far more successful than others?

This 400 word summary will give you a sense of the book. If you create products for a living this then I highly recommend this book!

Friday, September 16, 2016

The Lean Start Up

The Lean Start Up: How Today's Entrepreneurs Use Continuous Innovation to Create Radically Successful Businesses by Eric Ries.

While working at Gap Inc. concepts from this book were frequently leveraged. I learned by doing. After leaving Gap I read the book and see we implemented some of the concepts well and others, less so.

This is a great book worthy of study. I created a book summary using my kindle highlights. The summary is almost entirely excerpts from the book. My goal was to create a reference or study guide for myself and to share with others. If you haven't read the book and don't have the time to read the whole book (takes maybe a few days) then this summary is worth your time to read. If you only want to review the key concepts - it is probably not for you.

If I had it to do over again I would have synthesized more and excerpted less.  Live and learn.



Friday, September 9, 2016

Agile Leadership Video Interview

My friend Bob Payne from LitheSpeed who does the Agile Toolkit Podcast series interviewed me yesterday. The topic was Agile Leadership and he published the video interview on LitheSpeed's youtube channel so I thought I'd share. It was an engaging talk with Bob about agile leadership.

Friday, August 19, 2016

Lean IT Field Guide

A colleague from my Nationwide days, Tom Paider, co-authored a book with Michael Orzen called The Lean IT Field Guide: A Roadmap for Your Transformation. The book based in part on our shared experience of establishing Nationwide's Application Development Center. More on that here.

His book is a how to guide for applying certain aspects of Lean to corporate IT. The focus of the book is summarized below:

The modern enterprise has a simple goal: create sustainable value i.e. make money now and in the long run.  A principle challenge is leading and managing a very large number of people to achieve the goal. To achieve the goal most would agree we need:

  1. Alignment from the top down on purpose and strategy
  2. Effective bottom up execution matched to the strategy
  3. Clear feedback enabling the adjustments needed to achieve the purpose

In brief Lean enables people at every level to understand and improve the part they play in achieving the goal. Employees define baseline work processes called "standard work," along with their performance metrics, problems and potential solutions called "counter-measures." Collaborative cycles of Plan, Do, Check, Adjust apply the scientific method to solve these problems using a simple "A3"  document. Leaders "Go See" on "Gemba Walks" to understand what's happening and to mentor and learn how to support their people. All of this is big and visible.  All of this happens at every level - not just the front line.

I created an excerpt based summary of the book in two parts. Part one starts here.