SRE – Introduction

SRE is what happens when you ask a software engineer to design an operations team
– Ben Treynor Sloss, Google

Site Reliability Engineering (SRE) is among the most popular technology topics during the last few years, with the IT industry viewing it as a better way to run production systems by applying a software engineering mindset to accomplish the work that would otherwise be performed, often manually, by sysadmins. The definition of SRE by the originator of this term (Ben Treynor Sloss at Google) gives an insight into the vision with which this concept was originally created – “SRE is what happens when you ask a software engineer to design an operations team”. As it usually happens with any topic that becomes popular, there are numerous SRE experts in the industry who have interpreted the concept as it is most convenient for their needs. To avoid a biased understanding, I started learning about SRE by reading the book written by creators of this concept at Google – Site Reliability Engineering: How Google Runs Production Systems.

Most misinterpretations on what SRE team should do and who should be part of this team will go away if one understands this statement from the book: SRE is fundamentally doing work that has historically been done by an operations team, but using engineers with software expertise, and banking on the fact that these engineers are inherently both predisposed to, and have the ability to, design and implement automation with software to replace human labor.

Google’s Approach to Service Management
  • Hire software engineers to run products and to create systems to accomplish the work that would otherwise be performed manually
  • Without constant engineering, operations load increases and teams will need more people just to keep pace with the workload
  • 50% cap on the aggregate “ops” work for all SREs—tickets, on-call, manual tasks, etc.
  • When a SRE team consistently spends less than 50% of time on engineering work, shift some of the operations burden back to the development team or add staff to the team without assigning that team additional operational responsibilities
  • Want systems that are automatic, not just automated
  • SRE vs. Devops
    Before going further into SRE, let me compare SRE with Devops, which is a similar concept that addresses friction between development and operations. SRE and Devops are similar when it comes to bridging the gap between development and operations in addition to massive focus on automation. In Google’s view, SRE is a specific implementation of DevOps with some idiosyncratic extensions. There are significant differences too with Devops being a mindset focused on product development and delivery while SRE is a set of practices focused on post production reliability.

    SREDevops
    ProductionRemoving silos, “big picture”, delivering applications
    Set of practices and metricsMindset and culture of collaboration
    System availability and reliabilityProduct development and delivery
    Systems engineers who write codeEveryone involved
    How it should be doneWhat needs to be done

    SRE Responsibilities:

    SRE team is typically responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of the services they support. The core tenets of Google SRE are:

    • Ensuring a Durable Focus on Engineering
    • Pursuing Maximum Change Velocity Without Violating a Service’s SLO
    • Monitoring using automated software
    • Emergency Response designed to reduce Mean Time To Repair (MTTR)
    • Change Management that is automated to accomplish progressive rollouts, quickly detecting any problems and rolling back changes safely when problems arise
    • Demand Forecasting and Capacity Planning to ensure that the required capacity is in place by the time it is needed
    • Provisioning conducted quickly and only when necessary
    • Efficiency and Performance by predicting demand and provisioning capacity

    Many organizations embark on building a SRE team in addition to a dedicated multi-tiered Operations team to support a service. Adding a SRE team as just another layer to existing ones supporting a service will only make the Operations process more inefficient. Being on-call is one of the integral functions of a SRE team and transforming existing L2 Support team to SRE model will yield the best results. Instead of “my environment is unique and SRE won’t work” attitude, it is important to revisit the entire Operations process holistically considering SRE principles and practices. In the next two blogposts, I will cover key points on principles and practices followed by Google as mentioned in the book.

    The 4 Disciplines of Execution

    I had referred to 4DX in my previous blog and the book that introduced this concept was my next read. Executing projects to completion successfully has been my strength for long and after reading the book, was delighted to know that I already follow most of the rules given in this book. So, the book helped build my vocabulary on execution focus and also articulate how one can succeed with excellent execution. In this blogpost, have summarized the key rules and principles behind 4DX.

    People working at large organizations will be familiar with the struggle to prioritize execution of important strategic goals as they will invariably end up spending most of their time on urgent day-to-day operational tasks. So, the real enemy of execution is our day job, which the book calls the whirlwind. 4DX acknowledges the importance of whirlwind and provides a set of rules for executing our most critical strategy in the midst of our whirlwind.

    1. Discipline #1: Focus on the Wildly Important – The big idea here is to focus our finest effort on a few highly important goals that can he achieved in the midst of the whirlwind of the day job, rather than giving mediocre effort to dozens of goals.
      1. Rule #1: No team focuses on more than two Wildly Important Goals (WIGs) at the same time.
      2. Rule #2: The battle you choose must win the war.
      3. Rule #3: Senior leaders can veto, but not dictate.
      4. Rule #4: All WIGs must have finish line in the form of from X to Y by when.
    2. Discipline #2: Act on the Lead Measures: Discipline 1 takes the wildly important goal for an organization and breaks it into a set of specific, measurable targets until every team has a WIG that it can own. Discipline 2 then defines the leveraged actions that can enable the team to achieve that goal. Tracking a goal is done through two types of measures:
      • Lag Measures: Measurement of a result we are trying to achieve and called lag measure because by the time we get the data the result has already happened, so they are always lagging. Example – sprint velocity, lead time, revenue, profits, etc.
      • Lead measures: Foretell the result and is virtually within our control. Example – while a sprint goal (say velocity, lag measure) can be jeopardized due to external dependencies that are out of the team’s control, the team can certainly adhere strictly to acceptance criteria (lead measures like definition of ready and definition of done). And more the team acts on the lead measure, the more likely sprint goals will be accomplished. Lead measures should have two primary characteristics:
        • Predictive: If the lead measure changes, team can predict that the lag measure will also change.
        • Influenceable: It can be directly influenced by the team without a significant dependence on another team.
    3. Discipline #3: Keep a compelling Scoreboard – The third discipline is to make sure everyone in the team know the score at all times, so that they can tell whether they are winning. A Sprint Burndown Chart that tracks the team’s progress towards sprint goal can be an example. The following four questions will determine if the scoreboard is likely to be compelling to the team:
      • Is it simple?
      • Can I see it easily?
      • Does it show lead and lag measures?
      • Can I tell at a glance if my team is winning?
    4. Discipline #4: Create a Cadence of Accountability – The fourth discipline is to create a frequently recurring cycle of accountability (WIG sessions) for the past performance and planning to move the score forward. In a Scrum team, Sprint Retrospective is a routine that strives to achieve this by expecting the team to discuss those things that went well and others that went wrong, to identify improvement opportunities for future sprints. A WIG session has the following three part agenda:
      1. Account: Report on commitments
      2. Review the scoreboard: Learn from successes and failures
      3. Plan: Clear the path and make new commitments

    Over the years, I have seen numerous strategic organizational initiatives being launched with the right intent and much fanfare. However, only a few of them achieved the real goals and many went down quietly over time, slowly suffocated by the whirlwind. The book summarizes this situation beautifully and kindles hope at the end: Once people give up on a goal that looks unachievable – no matter how strategic it might be – there is only one place to go: back to the whirlwind. After all, it’s what they know and it feels safe. When this happens, your team is now officially playing not to lose instead of playing to win and there is a big difference. Simply put, 4DX gets an organization playing to win!

    Work deeply

    We are taught from childhood that “no pain, no gain”, emphasizing the importance of spending time and effort to achieve results. Once we grow up, we encounter self-help books that teach us to pivot towards working “smart”, not just “hard” and present their own theories and practices to make their point. All of them are based on the author’s personal experiences and beliefs, and we can benefit by learning from them and customizing for ourselves. One such book that stuck a chord with me is “Deep Work” by Cal Newport. Just as I finished the chapter on Rule #1 – Work Deeply, I was able to connect my learnings to a challenge my son wanted help with and immediately wrote this blog post with my thoughts.

    Let me start with references from the book, followed by my inferences that will explain the picture above.

    • Roy Baumeister on willpower: You have a finite amount of willpower that becomes depleted as you use it.
    • Ritualize: To make the most out of your deep work sessions, build rituals of the same level of strictness and idiosyncrasy. These rituals will minimize the friction in transition to depth, allowing us to go deep more easily and stay in that state longer.
    • 4DX: The 4 disciplines of execution (abbreviated 4DX) is based on the fundamental premise that execution is more difficult than strategizing. It helps address the gap between what needs to be done (strategy) and how to do it (execution):
      • Discipline #1: Focus on the Wildly Important – Ironically, the more you try to do by pushing too hard, the less you accomplish. So, execution should be aimed at a small number of “wildly important goals”.
      • Discipline #2: Act on the Lead Measures – There are two types of measures for success – lag measures and lead measures. Lag measures describe the thing you are ultimately trying to improve while lead measures focus on the new behaviors that will drive success on the lag measures. The problem with lag measures is that they come too late to change your behavior. So, start with acting on lead measures.
      • Discipline #3: Keep a Compelling Scoreboard – An always visible scoreboard creates a sense of competition that drives us to focus on these measures, even when other demands vie for our attention. It also provides a reinforcing source of motivation. Finally, it allows us to recalibrate expectations as required to achieve what is wildly important (like an agile burndown chart will help the team recalibrate efforts required to meet sprint goals).
      • Discipline #4: Create a Cadence of Accountability – A periodic review of scoreboard helps us to review progress towards lag measures and pivot as required in case the progress does not really converge towards producing the ultimate results expected.
    • Be Lazy: Tim Kreider in his “The Busy Trap” blog says – “I am not busy. I am the laziest ambitious person I know” and goes on to explain “Idleness is not just a vacation, an indulgence or a vice; it is as indispensable to the brain as vitamin D is to the body, and deprived of it we suffer a mental affliction as disfiguring as rickets. The space and quiet that idleness provides is a necessary condition for standing back from life and seeing it whole, for making unexpected connections and waiting for the wild summer lightning strikes of inspiration — it is, paradoxically, necessary to getting any work done“. So, periodic shutdown from work will enhance our ability to produce valuable output due to the following reasons:
      • Reason #1: Downtime aids insightUnconscious Thought Theory (UTT) posits that the unconscious mind is capable of performing tasks outside of one’s awareness, and that unconscious thought (UT) is better at solving complex tasks. The implication of this line of research is that providing your conscious brain time to rest enables your unconscious mind to take a shift sorting through your most complex professional challenges.
      • Reason #2: Downtime helps recharge the energy needed to work deeplyAttention Restoration Theory (ART) asserts that people can concentrate better after spending time in nature, or even looking at scenes of nature. The core mechanism of this theory is the idea that you can restore your ability to direct your attention if you give this activity a rest.
      • Reason #3: The work that regular downtime replaces is usually not that important – Working in information technology field for US multinationals from India means ones evening is core work time, attending meetings and discussions during the limited time zone “overlap” available. So, I replaced “evening downtime” with “regular downtime”, which for me is typically a few hours every morning when I go for my morning run and spend time with family. The point is that your capacity for deep work in a given day is limited and by setting aside some time for yourself to relax and reenergize, you are missing out on much of importance.

    Putting it all together: Going back to my son’s challenge, he got an unexpected one week holiday as his school had to reschedule classes due to second covid-19 wave. He was initially happy with this change as he could spend time as he wished (usually video games) for an extra week but the willpower he had originally corralled to excel academically had gone unutilized resulting in his initial euphoria eventually turning into guilt.

    As humans, we are usually tempted to engage in shallow activities that are easily enjoyable (like playing video games or watching our favorite TV show) instead of deep work (like studying or writing this blog). But we are also ambitious and aspirational with a need for achievement to make our life purposeful. Willpower helps us handle this conflict by motivating us to engage in deep work required for purposeful activities that are usually difficult. But our willpower is finite that depletes with use and needs to be recharged for sustenance. Overutilizing willpower will lead to exhaustion, underutilizing it will lead to guilt and smartly utilizing will produce great results for our wildly important goals. Smartly utilizing our willpower means:

    • Using rituals to minimize the friction in transition to deep work and amplifying the benefits of finite willpower
    • Using the 4 disciplines of execution to measure and progress towards our vision and goals
    • Provide ourselves sufficient rest to recharge the energy needed to work deeply

    With years of trial and error, I was already practicing some of these disciplines and this book provides additional structure that should make me more effective!

    Books 2020

    2020 was an exceptional year mostly consumed by the pandemic that caused unprecedented disruption globally, a year that many will remember forever but would like to forget. Most of the books I read during the year were around start-ups and management. In this blog, I have briefly covered some of the books I enjoyed reading during the year.

    The Hard Thing About Hard Things – Ben Horowitz

    Ben has shared his personal experiences in this book with profound insights into how to run an organization, starting from hiring the right executives, taking care of people and products, dealing with uncertainty and a lot more. While a lot of his lessons might appear to be applicable only in start-ups, I strongly believe they are relevant for any organization. Any big company using the excuse of being a large enterprise to tolerate inefficiencies, mediocrity, politics and sticking to outdated ways of working is certain to disintegrate and disappear over time, unless senior leadership stems the rot before it is too late. We have numerous examples of great companies disappearing rather quickly as they failed to adapt to changing business landscape. Ben has derived much inspiration from Andy Grove’s High Output Management, which was my next read.

    High Output Management – Andy Grove

    This book from almost forty years back is quite relevant even today. That a manager is primarily accountable for the success or failure of an organization is captured by the single most important sentence of this book: The output of a manager is the output of the organizational units under his or her supervision or influence. Many managers ascribe success to their own abilities but attribute failure to others, this behaviour is irresponsible and hollow. Every manager in information technology domain should ponder over the three questions that Andy asks:

    1. Are you adding real value or merely passing information along? How do you add more value?
    2. Are you plugged into what’s happening around you? And that includes what’s happening inside your company as well as inside your industry as a whole. Or do you wait for a supervisor or others to interpret what is happening?
    3. Are you trying new ideas, new techniques / technologies and personally trying them, not just reading about them? Or are you waiting for others to figure out how they can re-engineer your workplace – and you out of that workplace?

    The Ride Of A Lifetime – Robert Iger

    This was one of books recommended by Bill Gates and a Sunday Times Book of the Year 2019. Robert Iger became CEO of The Walt Disney Company in 2005 during a difficult time marked by digital disruption. He led Disney through acquisition of Pixar, Marvel, Lucasfilm and 21st Century Fox, and launch of Disney+ with excellent original content (particularly special for a Star Wars and Mandalorian fan). He signed off as CEO after making Disney the largest and the most respected media company in the world. He achieved this by brutally focusing on three clear strategic priorities from the day he became the CEO:

    1. Devote most of the time and capital to the creation of high-quality branded content.
    2. Embrace technology to the fullest extent, first by using it to enable the creation of higher quality products, and then to reach more consumers in more modern, more relevant ways.
    3. Become a truly global company.

    The book revolves around the ten principles that he calls out as necessary for true leadership: Optimism, Courage, Focus, Decisiveness, Curiosity, Fairness, Thoughtfulness, Authenticity, Integrity and The Relentless Pursuit of Perfection.

    Land Of The Seven Rivers – Sanjeev Sanyal

    This book was recommended by one of my friends towards the end of 2020 and turned out to be a page-turner that I completed in less than two weeks. I have read several English books on western history and always yearned for a good one on India. This book fulfilled that quest! Starting with Rodinia and Pangea from millions of years back, the book beautifully traces to Sapta-Sindhu (Land of the Seven Rivers). I have read about River Saraswathi (usually quoted from Rig Veda) in the past but this book provides the best historical details, and connects to current geographical features. Next time I visit Delhi, Haryana and Rajasthan, I will make it a point to see Ghaggar river. The events from Mauryan empire through Guptas, Mughals and British to the current Independent India are covered quickly but includes all significant events that shaped India to its current form. Indeed, a brief history of India’s geography!

    I always wondered why India, which was once a great civilization, did not keep up with the West in progressing during the last thousand years. Sanjeev has an insightful explanation! There appears to have been a shift in India’s cultural and civilizational attitude towards innovation and risk-taking from the end of the twelfth century. There are many signs of the closing of the mind:

    • Sanskrit, once an evolving and dynamic language, stopped absorbing new words and usages and eventually fossilized. Sanskrit literature became obsessed with purity of form and became formulaic.
    • Similarly, scientific progress halted as the emphasis shifted from experimentation to learned discourse.
    • Al-Biruni, writing at the same time that Mahmud Ghazni was making his infamous raids, commented that contemporary Indian scholars were so full of themselves that they were unwilling to learn anything from the rest of the world. He then contrasts this attitude with that of their ancestors.

    Hackers: Heroes of the Computer Revolution – Steven Levy

    Steven explains how “Hacker Ethic” evolved over three decades starting from the closed community of early mainframe hackers on time share terminals at MIT, to the open community of self-made hardware hackers out of their garages at Silicon Valley, finally paving the way for game hackers. Refer to my blog for my notes from this book.

    Zero to One – Blake Masters & Peter Thiel

    My reading habit also took a hit during the first couple of months of pandemic but I restarted reading books in June with this start-up book that I covered in an earlier blog.

    Hackers: Heroes of the Computer Revolution

    As a software engineer for more than 20 years, I have seen how computing has evolved since the beginning of internet age. A couple of months back, I heard one of my seniors passionately speak about his computing experience from the 80s that evoked my curiosity on early computer revolution. My search for an authority on this topic ended with the best-selling book by Steven Levy about hacker culture published in 1984 – Hackers: Heroes of the Computer Revolution.

    The intriguing element of the book is “Hacker Ethic” – in Steven’s words, it was a philosophy of sharing, openness, decentralization and getting your hands on machines at any cost to improve the machines and to improve the world. He narrates computer evolution from mid 1950s till 1984 from a hacker perspective, covering people and machines that might not be well known to people from the internet age.

    For software engineers of current millennium who take CPU speed at gigahertz and memory at terabytes for granted in 64-bit machines, it is unfathomable that predecessors from this hacker era created wonders with a tiny fraction of these resources in 8-bit machines. Assembly language programmers were a celebrated lot and they innovated by hacking primitive microprocessor-based computers with machine language!

    Steven explains how “Hacker Ethic” evolved over three decades starting from the closed community of early mainframe hackers on time share terminals at MIT, to the open community of self-made hardware hackers out of their garages at Silicon Valley, finally paving the way for game hackers. This forms the three parts of the book that I have summarized in the picture below.

    I have watched several videos on personal computing revolution and particularly enjoyed Triumph of the Nerds. But reading a book is always a unique experience as it gives an opportunity for imagination. As I read this book, I felt as if I was sitting beside the hackers and watching them code. While computing has changed a lot over generations, one aspect has remained just the same – hackers always push computers to the limits and their hunger for more has driven the industry forward!

    Blockchain

    Blockchain is a distributed ledger that is used to record transactions of a digital asset such as bitcoin. It has been one of the most talked about technologies in the 2010s, continuing to be popular for its tamper-proof design even after bitcoin became notorious for its price vagaries.

    Blockchain became a mainstream term since 2014 with applications across several industries being explored – to store data about property exchanges, stops in a supply chain, and even votes for a candidate. I have been following blockchain for many years and finally got some hands-on experience on Ethereum while completing a Pluralsight course.

    In this blog, I have listed key blockchain terms / concepts with brief description and also summarised the development environment used:

    • Blockchain is a chain of digital data blocks, with each one containing:
      • Information about transactions such as date, time, amount, price, etc.
      • Unique code called a “hash” that distinguishes a block from another. This is generated by a hashing process.
    • Hashing: an algorithm performed on data to produce an output that can be used to verify that data is not modified, tampered with or corrupted:
      • Length of output always the same.
      • One-way: the hash cannot be converted back into the original key.
      • Digital fingerprint that allows verifying consistency.
    • Obfuscation and Encryption provide data security to blockchain.
    • Key features of Blockchain: Immutable, decentralized, verifiable, increased capacity, better security, faster settlement.
    • Blockchain can be public (like Ethereum) or private (like R3 Corda).
    • Top implementations: Ethereum, Hyperledger Fabric, Ripple, Quorum, R3 Corda.
    • Distributed Applications (DAPPs) architecture using Ethereum: Blockchain encapsulates shared data and logic related to transactions while client is responsible for interface, user credentials and private data.
    • Payment for transactions is made through “gas”.
    • Transactions contain information on recipient, signature, value, gasprice, startgas and message.
    • A consensus mechanism is required to confirm transactions that take place on a blockchain without the need for a third party. Proof of Work and Proof of Stake are the two models currently available to achieve this.
    • Proof of Work is based on cryptography, hence digital coins like Bitcoin and Ethereum are called cryptocurrencies. Cryptography uses mathematical equations that are so difficult that only powerful computers can solve them. No equation is ever the same, meaning that once it is solved, the network knows that the transaction is authentic. Although Proof of Work is an amazing invention, it needs significant amounts of electricity and it is also very limited in the number of transactions it can process at the same time.
    • With Proof of Stake, miners can mine or validate block transactions based on the amount of Bitcoin held by them. This way, instead of utilizing energy to answer Proof of Work puzzles, a miner is limited to mining a percentage of transactions that is reflective of ownership stake.

    Development environment:

    Ethereum uses a programming language called Solidity, which is an object-oriented language for writing smart contracts. Solidity plugins are available for popular IDEs. I used the following setup:

    • IDE – Eclipse with YAKINDU-Solidity Tools / Visual Studio Code
    • Node.js Package Manager – npm
    • Windows Package Manager – chocolatey (installed from Powershell as administrator)
    • Node.js Windows Build Tool
    • Local ethereum test server and emulator – Ganache / Test RPC
    • Dev & Testing framework – Truffle Suite
    • Crypto wallet & gateway – Metamask
    • My practice code – https://github.com/gsanth/experiment/tree/master/ethereum_experiment

    To summarize, Blockchain’s potential as a decentralized form of record-keeping is enormous. From greater user privacy and heightened security to lower processing fees and fewer errors, blockchain technology will continue to see applications across several industries. However, it is easy to get carried away by blockchain’s apparent potential and get into technology overkill. So, it is important to understand the pros and cons of blockchain, and ensure we leverage it for the right use cases.

    ProsCons
    Decentralized (difficult to tamper)Technology cost for mining
    Improved accuracy (no manual effort)Low transactions throughput
    Reduced cost (no third party charge)Poor reputation (illicit activities)
    Transparent Technology

    AWS Certified Solution Architect – Associate

    In today’s rapidly changing technology landscape, staying relevant as a software engineer hinges on one’s ability to continuously learn and master emerging technologies. To this end, structured technology courses and certifications help lay a solid foundation that can lead to eventual mastery with real life hands-on experience. I target one technology certification every year, finished Stanford Machine Learning Certification last year and decided to pursue a cloud certification in 2020.

    My goal was to complete a comprehensive cloud learning path and had to choose between Google Cloud Platform, Microsoft Azure and Amazon Web Services. All of them offer similar services, so learning one will automatically build understanding of the others. I decided to pursue AWS certification as it is the clear industry leader with almost one third of global cloud market share. There are a dozen AWS certifications available and my choice was Solution Architect Associate, which is the most popular one as it covers the entire AWS offering.

    Magic Quadrant for Cloud Infrastructure as a Service, Worldwide (2020)

    Given my existing familiarity and understanding of cloud computing, I had a bit of a head start and was able to successfully complete my certification in about a month. This blog summarizes my experience and learnings through this journey.

    AWS certifications require serious preparation and usually starts with identifying a MOOC platform for access to learning material. I had used Coursera for ML certification last year as it came bundled with a Stanford certificate. As this certification is directly provided by AWS, I found Pluralsight to be more prolific and flexible for my needs.

    This was my first time using Pluralsight and was thoroughly impressed with the learning experience. I typically used youtube for quick reference of new technology topics but the video lectures on Pluralsight take learning experience to a completely different level. The one minor drawback is hands-on practice. While Coursera offered hands-on exercises, I had to independently create AWS Free Tier account to practice through AWS Management Console and CLI along with Pluralsight courses. Being a techie, this was perfectly fine with me and enjoyed this experience as well.

    After completing the first Pluralsight learning path and some hands-on practice, I attempted the sample questions on AWS site and felt my preparation was insufficient. So, I finished a few more relevant Pluralsight courses, digital training from AWS site, read through whitepapers and FAQs before attempting practice exams at Kaplan (offered free with Pluralsight).

    To summarize, I leveraged the following resources:

    AWS is vast and a solution architect is expected to understand all its offerings. So, developing deep understanding across foundational service domains of compute, networking, storage and databases, along with other key topics like security, analytics, app integration, containerization, cloud native solutions, etc made it an awesome learning experience.

    AWS Services: Knowledge Areas for Solution Architect – Associate Certification are in bold

    After about 3 weeks of preparation, I scheduled my exam at https://www.aws.training/Certification. AWS certification exams had to be taken at an exam center in a controlled environment but with COVID situation, proctored online exam option was also available. I chose Pearson VUE exam and scheduled for a Sunday morning slot. All the caveats called out with proctored online exam is a bit scary, particularly if internet or power connection is lost in the middle of the 130 minutes exam (quite common in India during monsoons). Fortunately, my internet connection remained stable despite heavy rain during the exam and allowed me to pass the exam. So, here I am – AWS Certified Solution Architect – Associate!

    Zero To One

    Information and Communication Technology has been the primary driver of innovation and engineering advances during the last four decades. The dominance is to such an extent that the term technology today refers to these fields, though there are several other engineering disciplines that continue to exist! I am fortunate to have started my professional career in information technology and am enjoying being part of it for more than twenty years. Unlike any other technology domain that emerged as new hotspot since the industrial revolution 200 years back, the entry barrier for information technology is extremely low that allowed passionate technologists to launch their enterprises from garages. And combine this with the success of venture capital industry from 1970s, start-ups have been the primary source of innovation in the technology industry since the advent of personal computing with Intel 8080 processor.

    While I have not worked for start-ups so far, I strongly believe that start-up lessons can help enterprises improve their ability to succeed while creating new products. In this blog, I will share my notes from “Zero to One”, one of the best books on start-up philosophy written by Peter Thiel, a successful entrepreneur himself.

    “Zero to One” has an explanation for most of the new technology trends during the last twenty years coming from start-ups. From the Founding Fathers in politics to the Royal Society in science to Fairchild Semiconductor’s “traitorous eight” in business, small groups of people bound together by a sense of mission have changed the world for the better. The easiest explanation for this is negative: it’s hard to develop new things in big organizations, and it’s even harder to do it by yourself. Bureaucratic hierarchies move slowly, and entrenched interests shy away from risk. In the most dysfunctional organizations, signaling that work is being done becomes a better strategy for career advancement than actually doing work. At the other extreme, a lone genius might create a classic work of art or literature, but could never create an entire industry. Startups operate on the principle that you need to work with other people to get stuff done, but you also need to stay small enough so that you actually can. Clayton Christensen has provided similar explanation in his book “The Innovator’s Dilemma” through the concept of “disruptive innovation” and how most companies miss out on new waves of innovation. Does it mean big companies cannot develop new things? They can, as long as they enable the teams focused on building new things to operate like a start-up without burdening them with bureaucracy and creativity sapping processes.

    Peter Thiel suggests that we must abandon the following four dogmas created after dot-com crash that still guide start-up business thinking today:

    1. Make incremental advances: Small increments using agile methods has far better chances of success today than waterfall world.
    2. Stay lean and flexible: Avoid massive plan and execute model. Instead, iterative development helps stay nimble and deliver through meaningful experimentation.
    3. Improve on competition: New things are invariably improvements on recognizable products already offered by successful competitors.
    4. Focus on product, not sales: Technology is primarily about product development, not distribution.

    But having seen a number of projects in large enterprises, I would say that sticking to these principles by default and making exceptions only for compelling reasons is better.

    When it comes to creating new software products for a market, don’t build an undifferentiated commodity business but one that’s so good at what it does that no other product can offer a close substitute. Google is a good example of a company that went from zero to one, after distancing from Microsoft and Yahoo almost 20 years back and became a monopoly. While monopolies sound draconian, the companies that get to the top create monopoly based on a unique value proposition they offer in their markets. So, don’t build new things unless there is a desire and plan to capture significant market share, if not monopoly. Every monopoly is unique, but they usually share some combination of the following characteristics:

    • Proprietary technology
    • Network effects
    • Economies of scale
    • Branding

    Another interesting observation is around secrets: most people act as if there were no secrets left behind. With advances in Maths, Science and Technology, we know a lot more about the universe than previous generations but there are still numerous unknowns yet to be conquered. It helps to be conscious of the four social trends that have conspired to root out beliefs in secrets:

    1. Incrementalism: From an early age, we are taught that the right way to do things is to proceed one very small step at a time, day by day, grade by grade. However, unlocking secrets requires us to be brutally focused on the ultimate goal rather than staying satisfied with interim milestones.
    2. Risk Aversion: People are scared of secrets because they are scared of being wrong. If your goal is to never make a mistake in life, you shouldn’t look for secrets. And remember, you can’t create something new and impactful without making mistakes.
    3. Complacency: Getting into a top institute or corporation is viewed as an achievement in itself with nothing more to worry and you are set for life. This leads to complacency and no more fire to unlock secrets.
    4. Flatness: As globalization advances, people perceive the world as one homogenous, highly competitive marketplace and an assumption that someone else would have already found out secrets.

    To summarise, when a start-up or an enterprise decides to create a new product, it should resist the temptation to go for a commodity one. It should be a product with clear differentiation that will help create a monopoly or significant market share at a minimum. This can happen only through hard work and dedication to unlock some secrets.

    There is a lot more learnings from the book but I have only mentioned the key ones that can help us introspect and stay focused on our goals to create new products.

    AI / ML in enterprises: Technology Platform

    As an organization embarks on leveraging AI / ML at enterprise scale, it is important to establish a flexible technology platform that caters well to different needs of data scientists and the engineering teams supporting them. Technology platform here includes hardware architecture and software framework that allows ML algorithms to run at scale.

    Before getting into software stack directly used by data scientists, lets understand the hardware and software components required to enable machine learning.

    • Hardware layer: x86 based servers (typically intel) with acceleration using GPUs (typically nvidia)
    • Operating Systems: Linux (typically redhat)
    • Enterprise Data Lake (EDL): Hadoop based repository like Cloudera or MapR, along with supporting stacks for data processing:
      • Batch ingestion & processing: example – Apache Spark
      • Stream ingestion & processing: example – Apache Spark Streaming
      • Serving: example – Apache Drill
      • Search & browsing: example – Splunk

    Once necessary hardware and data platforms setup, the focus is on providing an effective end user computing experience to data scientists:

    • Notebook framework for data manipulation and visualization: like Jupyter Notebooks or Apache Zeppelin, which support most commonly used programming languages for ML like Python and R.
    • Data collection & visualization: like Elastic Stack and Tableau.
    • An integrated application and data-optimized platform like IBM Spectrum makes it simple for enterprises by addressing all the needs listed above (components include enterprise grid orchestrator along with a Notebook framework and Elastic Stack).
    • Machine Learning platforms: specialized platforms like DataRobot, H2O, etc simplifies ML development lifecycle and lets data scientists and engineering focus on creating business value.

    There are numerous other popular platforms like Tensorflow, Anaconda, RStudio and evergreen ones like IBM SPSS, MATLAB. Given the number of options available, particularly open source ones, an attempt to create a comprehensive list will be difficult. My objective is to capture the high-level components required as part of Technology platform for an enterprise to get started with AI / ML development.