From an enterprise perspective, Big Data is disrupting the established thought process and status quo in data warehousing, reporting and analytics. Most enterprises see Big Data and Data Science as related, if not all together the same thing. While they mean different things, this assumption highlights the volume of clarification, explanation and settling that still has to take place for most enterprises. Our goal in the coming months will be to shed light on these concepts and address the value proposition they represent in both Enterprise and Technical scenarios.
The first series of articles will address some of the algorithms that can be applied to the data that is gathered by different types of companies. The articles will also explain the value of the results these algorithms produce and the steps that would be required to create these types of results. As will be seen in this exploration, using the right algorithm is rarely a black and white decision. Many times the effort will involve exploring paths that ultimately don’t bear any direct fruit. However, every effort will result in learning something about the question being asked or the data being explored. We will also explain how building a workflow and process to address this exploration will be of consummate value right from its inception. This value proposition is what will drive business to make the needed investment which will result in a rewarding return on this investment.
A typical data science project begins with data available at hand in terms of variety and veracity. There are questions requiring the future path determination for the organization to launch a product, introduce a service or merger of open and closed source projects and a new metadata is needed to be presented to the decision makers. The initiation of a data science project is the identification of the decision makers’ key decision drivers. These drivers can be more personalization and computational efficiency in advertising, marketing, monitoring social media for the impacts of organization products/services, image processing of account receivable checks or an autonomic analysis of the important documents.
Data Science is an area of exploring the complex processes dealing with both variables and data. These processes are often described as complicated tasks with complex algorithms. Data Science is one of the endeavors an organization can undertake where the costs of bringing in all stakeholders, sitting through problem definition phase, comparing terminologies, and later selecting algorithms, the main importance is the application of selected algorithm(s) to determine the test data sets metadata required to run and establish an estimated time to minimize the cost in terms of time, resource (Data Lake Serving Containers, Data Throughput) and the assignment of the Data Engineers and the Data Scientists. These exercises can introduce an enormous burden that “Not a Single Algorithm Selection” is often the conclusion to the initial process. Once a task is assigned to Data Science team(s), sometime the set expectations for an algorithm implementation does not bring in favorable results. This means that by creating requirements which are not fully understood before the task begins, by educating users on requirements only after having seen an initial version of the Data Science solution. Data Scientists work can also get more complicated, in case there are any changes in the requirements during the task is being processed, or incorporating methodologies that make the implementation of best practices impossible. A proof of concept is to be defined within Data Science on the available requirements from requested stakeholder(s). These can be customers, internal or external user.
A proof of concept in the area of the Data Science is a test of algorithms on a smaller set of data made by building a prototype architected from detailed process design documents. It is the first milestone in the implementation process that clarifies the requirements process and prototypes the configuration requirements. The Data Science proof of concept requires bringing the initial resultant data set to ensure that it will work in the proposed environment with larger data sets and function the way it was requested by the customer. The outputs from a proof of concept includes:
- Design documents that detail current business and best practice processes
- In depth tests and analysis of the capabilities of the algorithms
- Communication and training the algorithms within Machine Learning environments that facilitate end user adoption
- Architecture documents that specify data mapping and conversions
- Any interfaces to third party systems, such as Hadoop, Spark etc.
This complex of a process requires training, study and experience. We will be presenting the ideas in terms of articles featuring the Big Data and Data Science in this series of blogs. These will constitute Big Data/NoSQL Architectures, as well as the Application of Data Science in terms of Machine Learning, Descriptive, Predictive and Prescriptive Analytics utilization as Foundational Analytics for our readers. These set of articles will be helpful in understanding of the Data Science concepts and applications for both products, services, internal and external processes to an organization for embodying Big Data and the Data Science.
Watch this space for a weekly update on Data Science.
- A Business Introduction to C4.5
- Elasticsearch – Your Own Private Google
- Text Analytics
- What You Should Know About K-Means Clustering