Book Summary: Practical DataOps by Harinder Atwal
Link to the book: https://link.springer.com/book/10.1007/978-1-4842-5104-1
Overview
In a nutshell, this book serves as an excellent introduction to DataOps approach.
Chapter 1: Data Problem
The main causes that many organizations are not getting the return on investment they expect from their data science investments are: Knowledge gaps, outdated approaches to data management and analytics production, and a lack of support for data analytics within the organization.
Different types of knowledge gaps:
- The Data Scientist Knowledge Gap
- IT Knowledge Gap
- Technology Knowledge Gap
- Leadership Knowledge Gap
- Data-Literacy Gap
Chapter 2: Data Strategy
The question : Why We Need a New Data Strategies ?
Answer: Data Is No Longer IT => old IT approaches are obsolete.
Key component of a data strategy:
- The Organization
- People
- Technology
- Processes
- Data Asset
We should take into consideration:
- Missions
- Visions
- KPIs
Approach to define a data strategy:
- Define what needs to be change.
- Benchmark existing approaches.
- Define strategy objectives.
- Measurement plan.
Chapter 3: Lean Thinking
Lean thinking transformed the whole global manufacturing sector, as well as entire industries such as air transportation and software development processes, as well as the path to success for startups.
Problem
An understanding of what our customers value is required for a production system and product development system to be useful, so time and effort are not wasted building something they do not want.
Unfortunately, when it comes to data analytics and data science, our customers are frequently unsure of what they require, or, in the case of data engineering, which solution is best.
The Lean startup methodology
The Lean startup methodology addresses the uncertainty by first creating a minimum viable product (MVP) and then iteratively improving it based on validated learning and feedback. We can be more successful if we can learn faster. As a result, increasing delivery velocity in a sustainable way should be a top priority, if not obsession.
Lean Software Development
Mary Poppendiecks developed Seven Principles of Lean software development.
- Eliminate waste
- Build quality
- Create knowledge
- Defer commitment
- Deliver fast
- Respect people
- Optimize the whole
Chapter 4: Agile Collaboration
Agile Manifesto
Agile Manifesto includes four core values based on their experience of successful teams and projects.
- Individuals and interactions over processes and tools
- Working software over comprehensive documentation
- Customer collaboration over contract negotiation
- Responding to change over following a plan
Agile’s 12 principles
In addition to the four values in the agile manifesto, the 17 original signers also came up with 12 principles to guide practitioners implementing and executing with agility:
- Our highest priority is to satisfy the customer through early and continuous delivery of valuable software.
- Welcome changing requirements, even late in development. Agile processes harness change for the customer’s competitive advantage.
- Deliver working software frequently, from a couple of weeks to a couple of months, with a preference to the shorter timescale.
- Business people and developers must work together daily throughout the project.
- Build projects around motivated individuals. Give them the environment and support they need, and trust them to get the job done.
- The most efficient and effective method of conveying information to and within a development team is face-to-face conversation.
- Working software is the primary measure of progress.
- Agile processes promote sustainable development. The sponsors, developers, and users should be able to maintain a constant pace indefinitely.
- Continuous attention to technical excellence and good design enhances agility.
- Simplicity — the art of maximizing the amount of work not done — is essential.
- The best architectures, requirements, and designs emerge from self organizing teams.
- At regular intervals, the team reflects on how to become more effective and then tunes and adjusts its behavior accordingly.
DataOps Manifesto
Whether referred to as data science, data engineering, data management, big data, business intelligence, or the like, through our work we have come to value in analytics:
- Individuals and interactions over processes and tools
- Working analytics over comprehensive documentation
- Customer collaboration over contract negotiation
- Experimentation, iteration, and feedback over extensive upfront design
- Cross-functional ownership of operations over siloed responsibilities
DataOps Principles
The DataOps manifesto also lists 18 principles:
- Continually satisfy your customer. Our highest priority is to satisfy the customer through the early and continuous delivery of valuable analytic insights from a couple of minutes to weeks.
- Value working analytics. We believe the primary measure of data analytics performance is the degree to which insightful analytics are delivered, incorporating accurate data, atop robust frameworks and systems.
- Embrace change. We welcome evolving customer needs, and in fact, we embrace them to generate competitive advantage. We believe that the most efficient, effective, and agile method of communication with customers is face-to-face conversation.
- It’s a team sport. Analytic teams will always have a variety of roles, skills, favorite tools, and titles.
- Daily interactions. Customers, analytic teams, and operations must work together daily throughout the project.
- Self-organize. We believe that the best analytic insight, algorithms, architectures, requirements, and designs emerge from selforganizing teams.
- Reduce heroism. As the pace and breadth of need for analytic insights ever increases, we believe analytic teams should strive to reduce heroism and create sustainable and scalable data analytic teams and processes.
- Reflect. Analytic teams should fine-tune their operational performance by self-reflecting, at regular intervals, on feedback provided by their customers, themselves, and operational statistics.
- Analytics is code. Analytic teams use a variety of individual tools to access, integrate, model, and visualize data. Fundamentally, each of these tools generates code and configuration which describes the actions taken upon data to deliver insight.
- Orchestrate. The beginning-to-end orchestration of data, tools, code, environments, and the analytic teams work is a key driver of analytic success.
- Make it reproducible. Reproducible results are required, and therefore, we version everything: data, low-level hardware and software configurations, and the code and configuration specific to each tool in the toolchain.
- Disposable environments. We believe it is important to minimize the cost for analytic team members to experiment by giving them easy to create, isolated, safe, and disposable technical environments that reflect their production environment.
- Simplicity. We believe that continuous attention to technical excellence and good design enhances agility; likewise simplicity — the art of maximizing the amount of work not done — is essential.
- Analytics is manufacturing. Analytic pipelines are analogous to Lean manufacturing lines. We believe a fundamental concept of DataOps is a focus on process thinking aimed at achieving continuous efficiencies in the manufacture of analytic insight.
- Quality is paramount. Analytic pipelines should be built with a foundation capable of automated detection of abnormalities (jidoka) and security issues in code, configuration, and data and should provide continuous feedback to operators for error avoidance (poka yoke).
- Monitor quality and performance. Our goal is to have performance, security, and quality measures that are monitored continuously to detect unexpected variation and generate operational statistics.
- Reuse. We believe a foundational aspect of analytic insight manufacturing efficiency is to avoid the repetition of previous work by the individual or team.
- Improve cycle times. We should strive to minimize the time and effort to turn a customer need into an analytic idea, create it in development, release it as a repeatable production process, and finally refactor and reuse that product.
Chapter 5: Build Feedback and Measurement
Systems Thinking
Linear thinking (People think in linear cause-and-effect terms) is very effective at solving simple problems. However, the world mostly consists of complex interrelationships between people and objects, and it is hard to predict the impact of a change due to unanticipated side effects and feedback loops.
The world is an example of a system, which is a set of components that connect to form something more complicated. Organizations are a system that includes components such as teams, hierarchical structures, technologies, processes, policies, customers, data, incentives, suppliers, and conventions. Data analytics is also a system that interacts and supports other systems in an organization.
Chapter 6: Building Trust
A workflow to deliver trust in data
Chapter 7: DevOps for DataOps
The Conflict
Most organizations still struggle to deploy software changes in production every week or month let alone hundreds of times a day. Often, those production deployments are highstress affairs involving outages, firefighting, rollbacks, and occasionally much worse.
In August 2012, a trading algorithm deployment error by the financial services firm Knight Capital led to a $440 million loss in the 45 minutes it took to fix the problem.
The cost of the conflict is not just economic. For the employees involved, it creates stress and decreased quality of life through working evenings and weekends to keep the ship afloat
Breaking the Spiral
DevOps breaks the death spiral caused by the conflict between development and IT Operations. DevOps practices seek to achieve the aims of multiple IT functions, development, QA, security, and IT Operations while improving the organization’s performance.
Reproducible Environments
There are multiple approaches and tools for automatically building environments that are often used in combination with each other.
- Configuration orchestration.
- Operating system configuration.
- Virtual machine (VM) environments.
- Application container platforms.
- Configuration management.
- Package management.
DevOps Measurement
Measurement and feedback are necessary to understand if applications and systems are running as expected, whether goals are achieved, to fix problems, and innovate fast. At the deployment pipeline level, example metrics include deployment frequency, change lead time, the time it takes for a new feature or bug fix to go from inception to production, failure rate for production deployments, and mean time to recovery (MTTR). MTTR is the average time to recover from production failure.
Chapter 8: Organizing for DataOps
Team Structure
- Function-Orientated Teams
- Domain-Orientated Teams
The New Skills Matrix
- Core Personas
The primary personas are data platform administrator, data analyst, data scientist, data engineer, DataOps engineer, team lead, solutions expert, and organizational stakeholder.
- Supporting Personas
Typical supporting personas include data product owners, domain experts, analytics specialists (such as researchers or specialized data scientists), and technical specialists (e.g., data architects, software engineers, ML engineers, security experts, testers, designers).
There Is No I in Team
- I-shaped people to describe specialists with narrow but deep domain skills in one area. 8 Specialists create silos within teams. Although the work they do may only take hours or days to complete, there can be much longer wait times before they are available to work on something new.
- T-shaped skills people (also known as generalized specialists in Agile terminology) have deep expertise in one area such as data engineering. But, they also tend to have broad skills across many fields such as machine learning and data visualization
- Pi-shaped people have a wide breadth of knowledge of many disciplines and depth of skills in two areas.
- M-shaped people are poly-skilled; they combine the breadth of knowledge of T-shaped individuals with deep knowledge of three or more specialisms
- Pi- and M-shaped people increase team productivity by magnitudes through increased flow of work and ability to cross-train and grow others. E-shaped people have a combination of four Es — experience, expertise, exploration, and execution.
Chapter 9: DataOps Technology
The DataOps Technology Ecosystem
- The Assembly Line
- Data Integration
- Data Preparation
- Stream Processing
- Data Management
- Reproducibility, Deployment, Orchestration, and Monitoring
- Compute Infrastructure and Query Execution Engines
- Data Storage
- DataOps Platforms
- Data Analytics Tools
DataOps Build vs. Buy
- Build In-House
- Buy or Rent an Off-the-Shelf Product
- Borrow Open Source
- Cloud Native Architecture
Chapter 10: The DataOps Factory
Considerations
Getting data science and analytics right can seem intimidating. There are many factors to consider.
Recruitment, people development, culture, organization, prioritization, collaboration, data acquisition and ingestion, data quality, processes, algorithms, technology, reproducibility, deployment, operations, monitoring, benefit measurement, governance, data security, and privacy must all be mastered.