What is the 5 V of big data?
The 5 V’s of big data — velocity, volume, value, variety and truth — are the five primary and inherent characteristics of big data. Knowing the 5 V’s allows data scientists to get more value from their data while also allowing their organizations to become more customer-centric.
Earlier this century, big data was discussed in terms of the three V’s — volume, velocity and variety. Over time, two more V’s — value and truth — were added to help data scientists more effectively express and communicate the important characteristics of big data. In some cases, there is even a sixth term in V for big data – diversity.
What is big data?
Big data is a combination of unstructured, semi-structured or structured data collected by organizations. These data sets can be mined for insights and used in machine learning projects, predictive modeling and other advanced analytics applications.
Big data can be used to improve operations, provide better customer service and create personalized marketing campaigns — all of which add value for an organization. As an example, big data analytics can provide companies with valuable insights into their customers that can then be used to refine marketing techniques to increase customer engagement and conversion rates.
Big data can be used in healthcare to identify risk factors for disease, or doctors can use big data to help diagnose patients’ diseases. Energy industries can use big data to track electrical grids, perform risk management or for real-time market data analysis.
Organizations that use big data have a potential competitive advantage over those that don’t because they can make faster and more informed business decisions — as the data provides.
What are 5V?
The 5 V’s are defined as follows:
- acceleration is the speed at which data is created and how fast it moves.
- volume is the amount of data that qualifies as big data.
- bought is the value given by the data.
- different is the diversity that exists in the types of data.
- surely is the quality and accuracy of the data.
Velocity refers to how fast data is generated and how fast it moves. This is an important aspect for organizations that need their data to flow quickly, so it can be used at the right time to make the best business decisions possible.
An organization that uses big data has a large and continuous flow of data that is created and sent to its final destination. Data can flow from sources such as machines, networks, smartphones or social media. Velocity applies to the speed at which this information arrives — for example, how many social media posts are consumed per day — as well as the speed at which it must be digested and analyzed — often quickly and sometimes almost literally. that time.
As an example, in healthcare, many medical devices today are designed to monitor patients and collect data. From in-hospital medical equipment to wearable devices, collected data must be sent to its destination and analyzed quickly.
In some cases, however, it’s better to have a limited set of data collected than to collect more data than an organization can handle — as this can lead to slower speeds. data.
Volume refers to the amount of data available. Volume is like the basis of big data, as it is the initial size and amount of data collected. If the amount of data is large enough, it can be considered big data. However, what is considered big data is relative and will change depending on the available computing power available in the market.
For example, a company that operates hundreds of stores in multiple states generates millions of transactions per day. It qualifies as big data, and the average number of total transactions per day in stores represents its volume.
Value refers to the benefits that big data can provide, and it is directly related to what organizations can do with the collected data. Being able to pull value from big data is a necessity, as the value of big data increases greatly depending on the insights that can be derived from it.
Organizations can use big data tools to gather and analyze data, but how they get value from that data must be unique to them. Tools like Apache Hadoop help organizations store, clean and rapidly process this massive amount of data.
A good example of large amounts of data can be found in the collection of individual customer data. If a company is able to profile its customers, it can personalize their experience marketing and salesimproving the efficiency of contacts and reaping greater customer satisfaction.
Diversity refers to the diversity of data types. An organization may obtain data from multiple data sources, which may vary in value. Data can come from sources inside and outside a business as well. The challenge is different about the standardization and distribution of all the data collected.
As mentioned above, the collected data may be unstructured, semi-structured or structured. Unstructured data is data that is not organized and comes in different files or formats. Generally, unstructured data is not ideal for a mainstream relational database because it does not fit conventional data models. Semi-structured data is data that has not been organized in a special repository but contains related information, such as metadata. This makes it easier to process than unstructured data. Structured data, on the other hand, is data organized into a formatted repository. This means that the data is made more responsive for effective data processing and analysis.
Raw data also qualifies as a data type. While raw data may fall into other categories — structured, semi-structured or unstructured — it is considered raw if it does not receive any processing. Generally, the raw data can be imported from other organizations or submitted or entered by users. Social media data often falls into this category.
A more specific example can be found in a company that gathers various data about its customers. This may include structured data taken from transactions or unstructured social media posts and call center text. Much of this may come in the form of raw data, which requires cleaning before processing.
Truth refers to the quality, accuracy, integrity and credibility of data. Aggregated data may have missing pieces, may be inaccurate or may not provide real, valuable insight. Truth, in general, refers to the level of confidence in the collected data.
Data can sometimes be messy and difficult to use. A large amount of data can cause more confusion than insights if it is incomplete. For example, in the field of medicine, if the data about the medicines taken by a patient is incomplete, the patient’s life may be put in danger.
Value and authenticity help define the quality and insights gained from the data. Basics for truth of data is always — and should be — available in an organization at the executive level, to determine whether it is appropriate for high-level decision-making.
Where might a red flag appear in the truth data? It can, for example, lack the right data line — that is, a verifiable trace of its origins and movement.
The 6th V: Diversity
The 5 V’s above cover a lot of ground and go on to explain the proper use of big data. But there’s another V that deserves serious consideration — innovation — that doesn’t so much define big data as it emphasizes the need to manage it well.
Innovation refers to inconsistencies in the use or flow of big data. In the case of the former, an organization may have more than one definition used for particular data. For example, an insurance company may have one department that uses one set of risk criteria while another department uses a different set. In the second set, data flowing through the company’s data stores in a decentralized manner – without a common point of entry or forward validation – can find its way to different systems will change it, resulting in conflicting sources of truth on the reporting side.
Minimizing variability in big data requires careful creation of data flows as data moves through organizational systems, from transactional to analytical and everything in between. The biggest benefit is the big data truth, because consistency in the use of data can provide stronger reporting and analytics and therefore higher trust.
Learn which factors to consider when choose between a data lake and a data warehouse to store big data.