Making !T Work

Data → Information → Knowledge → Wisdom

The four Vs of Big Data

Like most things I am faced with, I like to distill the reality into as few workable blocks as the possible.   Some people may challenge me on trying to over simplify the complex, however, I take the approach that complexity can be distilled into many simple elements.   By doing so, the solution can be easier to manage.   It also makes it much simpler to communicate to all the stakeholders.   Engineering diagrams or architecture documents are a must for many team members but not necessarily for all the stakeholders.

Recently, I have been witness to many conversations regarding the processing, analysing and dash boarding of what is being call real time data.  I even struggle with that term for it is really near time operational data.   That is not relevant for this discussion.   What is relevant for this discussion is how do you design, build, test and implement for a complex data stream while at the same time present it graphically?

This is the crux of this simple discussion.   Before you create your first draft of the schema or the architecture you must put into perspective this part of the problem you are being asked to solve.   You must deal with the three “V”s of data which are volume, velocity, and variety.

 Big data simplified

The Four  “V”s of big data

Volume

How much data are you going to get?  How wide is the data and how deep is the data become very critical elements.   This will impact data storage and the type of storage.   Do you need SSD drives or will large capacity traditional drives work?   This becomes critical for if you do not understand the volume of the data to be dealt with how will you be able to store it to retrieve it.

Along with this you need to understand how the information will be presented.   The goal here is to know what the data is going to be used for.   This will impact indexes, schemas and normalization levels.   Too many times, this is overlooked and then organizations are forced to throw more hardware at it when a simpler less elegant data schema could have solved this.   Do not let perfection get in the way of good.   Good in this context is giving the user what they asked for instead of the perfect engineered lab experiment.

A pure data model may contain too many narrow tables and a pure business object model may contain too many unnecessary related tables.   Never loose site of the problem you are trying to solve.

Depending on how the data is coming to you, you may want to look at changes in the protocols, if possible.  Change the inbound process from XML to JSON for example can drop the payload volume by up to 40%.   Other simple techniques of reducing tag label names can have profound impacts on the volume of data.   Bigger, faster pipes work but can be expensive ongoing costs when a simpler transfer protocol could have been implemented.

Velocity

The question here is will the data be coming at you like a garden hose or a fire hydrant?   When a problem arises in production, I always look to see if the speed of the data is causing the issues.   This typically has impacts in a few areas.   One impact area is the inbound process architected to consume the data quickly?   Is the pipe big enough to handle the load.   When the data is so large that a parallel process approach is needed, I tend to lean towards a queuing mechanism.   Other may call it buffering but the end result is the same.  Take the massive inbound load, dump into a working area and then apply the complex Extraction, Transformation and Load (ETL) against this working set.   This allows for simple resets along with an excellent way of keeping the network data pipes uncluttered.

If the speed is slower but the volume is higher like in a complex transactional set then you can optimize the ETL process to make light work of the slower larger volume.  However if it is high volume and high velocity, then typically a parallel inbound process feeding a common working queue tends to give the most reliable and flexible approach.   This approach also allows for the abstraction of functions to be clearly defined and managed.

Variety

How complex and varied is the actual data?  This is where the reality of Extraction Transformation and Load (ETL) come into play.  Is the data is a simple narrow collection of a few elements or is this a very layered parent/child large transactional dataset?  How much transformation is needed and how many business rules are needed to make sense of the data to create an information set?   These types of engagements, when I get called in to address,  can take some time to optimize when it was not thoughtfully considered in the beginning.   In most cases the database teams tended to drift toward easy approaches instead of more holistic approaches.   Here there is no right or wrong answer for the overall picture has to be considered. 

Use creative pragmatic approaches.  Do you create functions, do you use complex joins, do you use indexes, do you use views, how are triggers being impacted?   These are the types of questions that need to be addressed.   However one of the bigger questions should be is how dynamic are the business rules?   If the business rules change frequently then you need to create an approach that allows for database control of the rules and NOT developers controlling the execution of the rules.   If you do use the database to control the rules then it will typically affect performance in a negative way.   Like I said it is a balancing act.   Do not forget to inquire about can the data be cleaned at source.

Veracity

This fourth V refers to the accuracy and relevance of the data.   A close friend of mine reminded me that there are 4 Vs to Big Data.   Upon reflection, he was correct, in reminding me of this.   This step is commonly addressed in the cleansing of the source data.   The part that is most important is to ask the tough question of relevance.   Is the data relevant?   What is the narrative of the information;  how is it conveyed to the consumers of it?   Everyone must be vigilant (look another V) as to the context of the content.

Before you embark on the journey;  make sure the story it will tell is the story that is relevant to the people reading it.   So strategic data is demonstrated strategically, tactical information can be actioned in workable horizons and finally operational data is relevant, current, and actionable information.   Operational groups do not care about 6 month horizons of the past or long term predictions of the future.   They need actionable, current and context information about the now.  

Taking the time to understand that narrative before you commence on the big data journey is very important.

In conclusion

The good news is that in many cases, a thoughtful design up front that is respectful of the variety while understanding the volume and velocity can be achieved with much less effort than trying to deal with it while in production.

Big data does not have to be scary for big ideas are just a collection of smaller ideas.

Making !T Work   

Copyright © 2019. JAAT Solutions All Rights Reserved.