Politics

5 Lessons to Learn from the Fastly and Akamai Outages – ReadWrite

Published

3 years ago

07/27/2021

Delivering a website, Domain Name Service

Fastly, one of the Internet’s many critical components experienced an outage on June 8th that took some of the most prominent websites online — well, offline. Then, almost two weeks later, Akamai, one of the largest global content delivery networks, also stumbled — taking out online systems for airlines, banks, and stock exchanges worldwide.

Lessons to Learn from the Fastly and Akamai Outages

In light of these recent outages, it’s important to remember that failures will happen, but the vital attribute of failure is the opportunity to learn. So what can you learn from these examples? Here are five lessons learned from the recent outages and actions you can take to ensure you have a fast website that provides a reliable digital experience even when infrastructure partners fail.

Lesson 1: Everything fails eventually. Have a backup plan.

I was told as a young engineer: “If you design software with the expectation that every dependency will fail at some point, you’ll never be disappointed, regardless of the outcome.”

The recent failures certainly bring this lesson home, but we forget things fail all the time. Although the scope and magnitude of the Fastly and Akamai outages drew headlines, the reality is that the internet experiences failures all the time. For example, for the week of June 14 – 20, 2021, there were: 427 network outages, 352 internet service provider outages, and 23 outages in the public cloud.

It’s more important than ever to recognize the inescapable fact of the internet: everything fails eventually. To combat this impact, reliability engineers look to implement redundancy wherever financially and operationally possible.

An Online Infrastructure

When users access an online store, a delicate handoff occurs between multiple providers of core infrastructure. So first, let’s take a quick look at the significant infrastructure levels for accessing an online store and identify the opportunities and costs to implement some levels of force majeure protection. The first step is the Domain Naming Service (DNS).

The Domain Naming Service (DNS)

DNS is responsible for translating a website’s name (e.g., your website here) to its underlying Internet Protocol (IP) address (e.g., think of the IP number as a global internet phone book).

DNS is distributed, with 13 core Root Servers providing the backbone database and thousands of copies replicated globally for multiple geographic locations. In addition, every website has a DNS provider responsible for mapping a site’s name to its existing IP value.

As these values may change over time, DNS providers implement a Time to Live (TTL) for each record update to ensure that the latest information is continuously updated. If a DNS record does not exist for a website, the end-user is left facing an error page stating, “Server Not Found/Webpage Not Available.” (There are many other reasons to “get” an error page also.)

For many companies, DNS functionality is provided by a single company — leaving companies exposed to DNS lookup failures if that provider has a material outage. Implementing redundancy for DNS does not cost much, as DNS services are a few hundred dollars annually at most — but there is a high operational cost, as multiple providers must be updated simultaneously about any backend changes.

As is common in the modern age, a bevy of companies provide offerings to automate this process, and it’s often money well spent.

The Edge Network / Content Distribution Network

Delivering a website: the Edge Network

The next level is the “edge” network, often a Content Distribution Network (CDN).

The CDN is at this layer where our high-profile failures (Fastly and Akamai) took place. CDNs help websites load faster by reducing the physical distance between your web server and the user. CDN’s enable users worldwide to view the same high-quality content without slow loading times while simultaneously employing a global fleet of servers to power your online presence. Unlike DNS redundancy, where the financial cost is minimal, implementing multiple redundant CDNs is expensive.

For this reason, CDN redundancy is often employed only by enterprise customers and large eCommerce brands. In most cases, the CDN redundancy is implemented at the DNS layer itself. However, for smaller companies, ensuring you have up-to-date IP addresses for your origin server (your web host itself) can save the day if you have DNS separated from your CDN provider.

CDN Failure Mitigation

When CDNs fail, engineers can update DNS records to have users bypass the CDN altogether. This gives customers a functional (but slower) experience. During the Fastly outage, many companies sidestepped the impact by simply redirecting users to either their web server or backup CDN provider.

Mitigating CDN failure via DNS bypass

The Origin Web Server

This is either a hosted content management system (Magento, WordPress, etc.) or a platform (Shopify, Kinsta, etc.). This is the traditional place where IT deploys redundancy resources, with backups and load balancing often already in place.

Delivering a website: Your Web Server

An important lesson from the Fastly outage: ensure your web servers could operate a capacity to service all customers if necessary. If you are forced to bypass the CDN – your web servers will be responsible for serving all the traffic. CDNs often cache between 60-95% of all web requests—so if you need to bypass this provider due to an outage, can your web server keep up with 10x the site traffic?

Lesson 2: Understand your third-party dependencies.

When critical infrastructure fails and takes a site offline — you have a fatal failure. For example, if you have a first-party dependency on a provider, and that provider fails—your support team is hit with a barrage of angry customers (assuming they can reach you).

But what about third-party dependencies, those critical services woven into every online store? The blast radius of the Fastly outage was more extensive than sites that went offline, as it took out hundreds of SaaS companies. Around the globe, marketing and development teams had blackouts in analytics data, failures in the email follow-up campaigns, and more esoteric impacts, such as failures to calculate shipping costs for some regions.

Third-party dependencies also show up in the user experience, such as external JavaScript (JQuery, D3.js) failing to load and render the page correctly. These non-fatal failures often cause the biggest headache, as users think the site is operational, but some components (e.g., clicking buttons) don’t work.

Analyzing your infrastructure dependencies

This free online tool provides a way to analyze the infrastructure needs of any website. Using Tesla.com as an example, we can see that there are dependencies on Akamai, Google, and Microsoft. Each one of these providers plays a critical role in the Tesla experience.

Analyzing infrastructure dependecies

For larger sites, especially those that employ regional content delivery, the web of dependencies can be extensive (and vary by the user’s location!). For example, looking at BBC.com, we can see dependencies on three CDN providers, three cloud providers, and a direct hosted advertising network. That’s a significant amount of infrastructure to oversee.

Looking at the cross infrastructure needs for the BBC site

Fatal Failures vs. Non-Fatal Failures

The solution reliability engineers employ here is to make as many failures fatal as possible. At first, this might seem counterintuitive, but a partially working site may indeed be more damaging than one that is hard down. Furthermore, fatal failures are the easiest to debug as the failure is explicit—the system itself stops.

Non-fatal failures, on the other hand, are often “Heisenbugs,” i.e., notoriously tricky issues that may be transient and never truly root-caused because overall, the system continues operating.

In light of this, reliability engineers push toward making failures explicit and minimize the blast radius of dependencies by self-hosting as many services as possible.

For example, when an essential piece of JavaScript is needed to load some functionality, hosting this “on origin” (your webserver) is faster and more transparent. In addition, in the ever-increasing push towards privacy, hosting assets (fonts, JavaScript, images, etc.) “on origin” minimizes data sharing with external providers.

The key takeaway: When possible, streamline and host your dependencies on your infrastructure — and may all your failures be fatal.

Lesson 3: If you don’t measure, you won’t know.

For explicit fatal failures, the monitoring challenge is simplistic — is the website up? But what if just some parts of the website are broken, or worse, so slow they seem broken? Would you even know? Modern websites are surprisingly complex. The average website needs 73 network requests to load. That’s 73 different network calls out to dozens of separate networks. When outages happen, they might only impact one of those requests — but maybe it’s critical (think: credit card validation?).

To make matters worse, site speed is not deterministic. For example, sites that load personalized content or ads may experience vastly different performance characteristics from user to user, region to region, or device to device. Complex systems require robust monitoring, and it’s never been a better time to implement it than now.

If it’s not Real User Monitoring, it’s, by definition, Fake User Monitoring.

The only way to know how your site performs for your users is to measure its performance when loading and interacting with the site. This type of measurement is commonly referred to as Real User Monitoring, and in the world of eCommerce, this is the only monitoring worth looking at.

When explicit failures happen, like the failure of a third-party component to load, Real User Monitoring systems provide detailed views about what content failed, on what device, and from which network or infrastructure partner it came from.

For implicit failures, where a third party may be in a degraded state and thus serving content more slowly, Real User Monitoring is the canary in the coal mine that provides accurate and actionable data to reliability engineers on what’s going on.

Looking at the distribution of real user experiences — **Server response delays, as seen by looking at real user experiences.**

**Infrastructure delays as seen through real user monitoring**

In an eCommerce world, where site speed is critical to business success, Real User Monitoring provides the flight data recorder that engineers and business leaders need to optimize the store. This is even more important today, where even a tenth of a second slowdown in page load time can result in an 8.4% drop in conversion rate and a 9.2% decrease in average order value.

Lesson 4: Error messages matter.

When sites have a fatal failure, the “site unavailable” page rears its head. However, if the failure is more pronounced or further up the delivery chain (as was the case with Fastly), the error page might be something even less user-friendly.

Error page during Faslty outage — Screengrab of The Guardian website during the Fastly outage. Credit: Twitter/@matthewchampion

Error pages are often overlooked as a potential source of customer outreach. Yet, a well-crafted error page can turn a frustrated, lost customer into a future potential sales opportunity.

Making great error pages

Error pages offer an opportunity to convey important information to your customer. Great error pages have three key attributes:

Acknowledge: This is your error, not the customers. It’s critical to acknowledge this and offer links to support services, social updates, and status pages. The key is to ensure the customer knows this is a temporary failure and that you will be back online soon.
Apologize: For wasting the customer’s time. Nobody arrives at your website for free; you’ve either paid with advertising dollars or marketing. Now that we have them at the site, we have failed to deliver our offered value. Please take a moment to convey that their visit matters to you.
Award: Just because your site is offline doesn’t mean the relationship has to end here. Offer the customer a discount if they provide you an email. Error pages can also redirect to third-party websites that (hopefully) are not offline. Use this moment to regain customer trust and move the sale process forward.

When done correctly, error pages can be superheroes—and give your support teams (who are already dealing with other issues) some cover. When eCommerce sites are down, we also likely have lost our tracking and metrics capabilities, so capturing that user email and following up with additional offers might save the day.

Lesson 5: Client-side technology offers advanced protection.

When it comes to failure prevention, we often look at things we can implement at the infrastructure or server level. But what if the failure is the user’s network (cable’s out)? What if your site is so big that you can’t track every infrastructure partner? What if your IT team is months behind on implementing your last infrastructure request? Is it time to give up? Nope, it’s time to look at client-side solutions.

Client-side performance solutions run inside the user’s browser itself. These are pieces of code that you ship with your website but run directly in the browser itself — like a guardian angel watching over the page load. And over the past decade, the web has made some powerful yet often overlooked client-side solutions. But none more potent than Service Workers.

At your Service (Workers)

The Service Worker API was initially designed to facilitate offline browsing of a website, specifically Gmail. When Gmail first came out, people used it on early smartphones. When they went into the subway (where network connectivity is zero)— they couldn’t use the site. Obviously, an email client that couldn’t work offline was a buzzkill.

To fix this, the Google team developed a feature in the browser that would allow the browser itself to have some control over a website, even if the network was down.

They called the new feature Service Workers, as they are a form of code that can run in the browser (do work) but aren’t dependent on external network services. Instead, service Workers act as a proxy between the website, the browser, and the network — and give developers the ability to store data on the device and respond to requests from inside the browser. In many ways, they are infrastructure-level ideas but run directly on the user’s device itself.

How Service Workers work — The Service Worker flow: adding client-side resiliency and performance

Service Workers can intercept network requests the browser sends, take action based on whether the network is available, endpoints are responding quickly, or return a locally cached copy of the site in the event of a server error. In advanced cases, they can enable client-side caching that makes the site both more reliable and faster.

How Service Workers can help in times of peril:

Caching client-side:

The number-one benefit is the ability to store data, including those pesky third-party resources, on the device itself. A dedicated client-side cache will dramatically speed up a working website and provide some level of protection when individual assets are failing. If the cache is advanced enough, you may be able to reduce your returning customer’s dependencies on your network-based infrastructure by 70% or more.

Client side caching can help pages load 65% faster — Client-side caching can both protect from infrastructure failures AND load the website faster.

Client-side failover:

Implementing multi-CDN solutions can be costly, and as we discussed, and require a certain level of operational expertise. However, with a Service Worker, you can implement client-side failover that is both automatic and operationally simplistic.

For example, a client-side failover rule might say, “if foo.com is unavailable, or not responding in one second or less, then automatically try backup.foo.com.” All the benefits of advanced infrastructure level failover, with minimal effort.

Client-side data buffering:

Most solutions that implement advanced client-side functionality include client-side performance metrics (Real User Monitoring). We can also continue capturing marketing data such as Google Analytics events and store them client-side for later transmission when the site is back online. No more data loss!

Advanced offline error messages:

What’s better than an error page? How about a full offline error site. Service Workers were designed to allow websites to work—partially—offline. You may not complete checkout, but you can still have a few top product pages and a client-side version of a AAA error response. Combined with client-side caches and data buffering, you might be able to allow a customer to “start to shop” while the website comes back online in the background.

Learning from failure

As they say, “to err is human, but to error is software.” As the world trends toward online-first marketplaces, it’s even more vital to learn from failure. The techniques and best practices outlined here give a glimpse into how we can all build robust, performant – and user-centric digital experiences—even in failure.

Image Credit: blue bird; pexels; thank you!

Jake Loveless

CEO

Jake Loveless has a 20-year career in making critical technology go faster. Jake is the CEO of Edgemesh, the global web acceleration company. Edgemesh enhances any browser,
on any device, with intelligent, automated, next-generation client-side caching.

Related Topics:Akamai Fastly learn lessons outages ReadWrite

Up Next

Building a (Big) Data Pipeline the Right Way – ReadWrite

Don't Miss

What Does the Future of Telehealth Look Like? – ReadWrite

Politics

Fintech Kennek raises $12.5M seed round to digitize lending

Published

6 months ago

10/11/2023

Drew Simpson

Google eyed for $2 billion Anthropic deal after major Amazon play

London-based fintech startup Kennek has raised $12.5 million in seed funding to expand its lending operating system.

According to an Oct. 10 tech.eu report, the round was led by HV Capital and included participation from Dutch Founders Fund, AlbionVC, FFVC, Plug & Play Ventures, and Syndicate One. Kennek offers software-as-a-service tools to help non-bank lenders streamline their operations using open banking, open finance, and payments.

The platform aims to automate time-consuming manual tasks and consolidate fragmented data to simplify lending. Xavier De Pauw, founder of Kennek said:

“Until kennek, lenders had to devote countless hours to menial operational tasks and deal with jumbled and hard-coded data – which makes every other part of lending a headache. As former lenders ourselves, we lived and breathed these frustrations, and built kennek to make them a thing of the past.”

The company said the latest funding round was oversubscribed and closed quickly despite the challenging fundraising environment. The new capital will be used to expand Kennek’s engineering team and strengthen its market position in the UK while exploring expansion into other European markets. Barbod Namini, Partner at lead investor HV Capital, commented on the investment:

“Kennek has developed an ambitious and genuinely unique proposition which we think can be the foundation of the entire alternative lending space. […] It is a complicated market and a solution that brings together all information and stakeholders onto a single platform is highly compelling for both lenders & the ecosystem as a whole.”

The fintech lending space has grown rapidly in recent years, but many lenders still rely on legacy systems and manual processes that limit efficiency and scalability. Kennek aims to leverage open banking and data integration to provide lenders with a more streamlined, automated lending experience.

The seed funding will allow the London-based startup to continue developing its platform and expanding its team to meet demand from non-bank lenders looking to digitize operations. Kennek’s focus on the UK and Europe also comes amid rising adoption of open banking and open finance in the regions.

Featured Image Credit: Photo from Kennek.io; Thank you!

Radek Zielinski

Radek Zielinski is an experienced technology and financial journalist with a passion for cybersecurity and futurology.

Politics

Fortune 500’s race for generative AI breakthroughs

Published

6 months ago

10/11/2023

Drew Simpson

As excitement around generative AI grows, Fortune 500 companies, including Goldman Sachs, are carefully examining the possible applications of this technology. A recent survey of U.S. executives indicated that 60% believe generative AI will substantially impact their businesses in the long term. However, they anticipate a one to two-year timeframe before implementing their initial solutions. This optimism stems from the potential of generative AI to revolutionize various aspects of businesses, from enhancing customer experiences to optimizing internal processes. In the short term, companies will likely focus on pilot projects and experimentation, gradually integrating generative AI into their operations as they witness its positive influence on efficiency and profitability.

Goldman Sachs’ Cautious Approach to Implementing Generative AI

In a recent interview, Goldman Sachs CIO Marco Argenti revealed that the firm has not yet implemented any generative AI use cases. Instead, the company focuses on experimentation and setting high standards before adopting the technology. Argenti recognized the desire for outcomes in areas like developer and operational efficiency but emphasized ensuring precision before putting experimental AI use cases into production.

According to Argenti, striking the right balance between driving innovation and maintaining accuracy is crucial for successfully integrating generative AI within the firm. Goldman Sachs intends to continue exploring this emerging technology’s potential benefits and applications while diligently assessing risks to ensure it meets the company’s stringent quality standards.

One possible application for Goldman Sachs is in software development, where the company has observed a 20-40% productivity increase during its trials. The goal is for 1,000 developers to utilize generative AI tools by year’s end. However, Argenti emphasized that a well-defined expectation of return on investment is necessary before fully integrating generative AI into production.

To achieve this, the company plans to implement a systematic and strategic approach to adopting generative AI, ensuring that it complements and enhances the skills of its developers. Additionally, Goldman Sachs intends to evaluate the long-term impact of generative AI on their software development processes and the overall quality of the applications being developed.

Goldman Sachs’ approach to AI implementation goes beyond merely executing models. The firm has created a platform encompassing technical, legal, and compliance assessments to filter out improper content and keep track of all interactions. This comprehensive system ensures seamless integration of artificial intelligence in operations while adhering to regulatory standards and maintaining client confidentiality. Moreover, the platform continuously improves and adapts its algorithms, allowing Goldman Sachs to stay at the forefront of technology and offer its clients the most efficient and secure services.

Featured Image Credit: Photo by Google DeepMind; Pexels; Thank you!

Deanna Ritchie

Managing Editor at ReadWrite

Deanna is the Managing Editor at ReadWrite. Previously she worked as the Editor in Chief for Startup Grind and has over 20+ years of experience in content management and content development.

Politics

UK seizes web3 opportunity simplifying crypto regulations

Published

6 months ago

10/10/2023

Drew Simpson

As Web3 companies increasingly consider leaving the United States due to regulatory ambiguity, the United Kingdom must simplify its cryptocurrency regulations to attract these businesses. The conservative think tank Policy Exchange recently released a report detailing ten suggestions for improving Web3 regulation in the country. Among the recommendations are reducing liability for token holders in decentralized autonomous organizations (DAOs) and encouraging the Financial Conduct Authority (FCA) to adopt alternative Know Your Customer (KYC) methodologies, such as digital identities and blockchain analytics tools. These suggestions aim to position the UK as a hub for Web3 innovation and attract blockchain-based businesses looking for a more conducive regulatory environment.

Streamlining Cryptocurrency Regulations for Innovation

To make it easier for emerging Web3 companies to navigate existing legal frameworks and contribute to the UK’s digital economy growth, the government must streamline cryptocurrency regulations and adopt forward-looking approaches. By making the regulatory landscape clear and straightforward, the UK can create an environment that fosters innovation, growth, and competitiveness in the global fintech industry.

The Policy Exchange report also recommends not weakening self-hosted wallets or treating proof-of-stake (PoS) services as financial services. This approach aims to protect the fundamental principles of decentralization and user autonomy while strongly emphasizing security and regulatory compliance. By doing so, the UK can nurture an environment that encourages innovation and the continued growth of blockchain technology.

Despite recent strict measures by UK authorities, such as His Majesty’s Treasury and the FCA, toward the digital assets sector, the proposed changes in the Policy Exchange report strive to make the UK a more attractive location for Web3 enterprises. By adopting these suggestions, the UK can demonstrate its commitment to fostering innovation in the rapidly evolving blockchain and cryptocurrency industries while ensuring a robust and transparent regulatory environment.

The ongoing uncertainty surrounding cryptocurrency regulations in various countries has prompted Web3 companies to explore alternative jurisdictions with more precise legal frameworks. As the United States grapples with regulatory ambiguity, the United Kingdom can position itself as a hub for Web3 innovation by simplifying and streamlining its cryptocurrency regulations.

Featured Image Credit: Photo by Jonathan Borba; Pexels; Thank you!

Deanna Ritchie

Managing Editor at ReadWrite

Deanna is the Managing Editor at ReadWrite. Previously she worked as the Editor in Chief for Startup Grind and has over 20+ years of experience in content management and content development.

Seminole Press

5 Lessons to Learn from the Fastly and Akamai Outages – ReadWrite

Lessons to Learn from the Fastly and Akamai Outages

Lesson 1: Everything fails eventually. Have a backup plan.

An Online Infrastructure

The Domain Naming Service (DNS)

The Edge Network / Content Distribution Network

The next level is the “edge” network, often a Content Distribution Network (CDN).

CDN Failure Mitigation

The Origin Web Server

Lesson 2: Understand your third-party dependencies.

Analyzing your infrastructure dependencies

Fatal Failures vs. Non-Fatal Failures

Lesson 3: If you don’t measure, you won’t know.

If it’s not Real User Monitoring, it’s, by definition, Fake User Monitoring.

Lesson 4: Error messages matter.

Making great error pages

Lesson 5: Client-side technology offers advanced protection.

At your Service (Workers)

How Service Workers can help in times of peril:

Caching client-side:

Client-side failover:

Client-side data buffering:

Advanced offline error messages:

Learning from failure

Jake Loveless

CEO

You may like

Politics

Fintech Kennek raises $12.5M seed round to digitize lending

Radek Zielinski

Politics

Fortune 500’s race for generative AI breakthroughs

Goldman Sachs’ Cautious Approach to Implementing Generative AI

Deanna Ritchie

Managing Editor at ReadWrite

Politics

UK seizes web3 opportunity simplifying crypto regulations

Streamlining Cryptocurrency Regulations for Innovation

Deanna Ritchie

Managing Editor at ReadWrite