Featured Post
Software

SaaS Inception: The gotchas of dogfooding

With SaaS inception, we dogfood our product and benefit from using it.

At Octane, we make it easy for companies to implement consumption-based billing and pricing to charge their users. Naturally, it would make sense that we need to charge our own users by using our own product.

Welcome to SaaS inception.

With SaaS inception, we dogfood our product and benefit from using it (dogfood means trying and testing your own product). We get two primary benefits from doing this:

  • Using our own billing/pricing system: Given the nature of our product it would be a wasted opportunity to not use ourselves to bill our own customers. Additionally, we would get deeper insight into how our customers use the product therefore helping us make informed decisions about our pricing strategy.
  • Testing: Testing at a company that handles billing/pricing is very important, and internally, we have a quality assurance (QA) process that involves unit and functional tests, bug bashes, and periodic load testing. This process usually suffices but using your own product takes it up a notch as it involves interacting with different UX components, hitting hard-to-test edge cases and potentially catching errors before your customers might.

With these incentives in mind we decided to onboard ourselves onto our own product and thought that it would be a relatively simple task. Turns out it isn’t too hard, but there are some things you probably want to be careful about that might either cause immediate trouble or long term pain a.k.a. the gotchas of dogfooding.

Before we dive into the problems we ran into, a little introduction to our system would be helpful.

Basic system overview

We internally call Octane’s customers vendors, and our vendor’s customers are simply called customers. Our vendors want to measure fine-grained usage of their customers (in order to charge them based on it). We enable this by allowing them to send us API calls when indicating a measurement.

We then store the measurements, post-process it, let you deep-dive and understand the intricacies of your customer usage. This enables you to create flexible price plans that would be best suited for your usage and pricing models.

At the end of the day, we also want to charge our vendors based on usage, hence it would be perfect to make Octane a vendor on Octane.

Let’s simplify and break it down into some hypothetical assumptions:→ Octane has a vendor called "Twilior", which we charge based on the number of events (API call - sendMeasurement()) they send us.

→ Twilior has a customer called "Ubero", which they charge based on the number of messages (API call - sendMessage()) they send Twilior.

→ Twilior uses Octane to handle all of it's billing, price plan creation, usage measurement needs. Therefore whenever Ubero uses the sendMessage() API, Twilior handles it, and also makes a call to Octane by calling sendMeasurement().

The diagram below makes this clear:

Gotcha #1: Infinite Loops

Now because Octane is also a vendor on Octane and we want to charge Twilior based on the number of sendMeasurement() calls they do, we decided to call our own API. As it turns out, this leads to an infinite loop. This seems to be an easy-to-spot issue at first, but depending on how complex your system is and how deep you make the callback to your system in your stack, it might be avoidable but also harder to detect. Overall, should be pretty easy to get over this one (guarding it with an if-statement):

Gotcha #2: External facing errors for internal facing issues

Assume for some reason our Octane vendor was not properly onboarded. Maybe we forgot to add the necessary details needed for onboarding. As a result, when doing the sendMeasurement(vendor = Octane) call, it causes a failure, because the vendor details could not be found. This is a hypothetical scenario, but something that's easy to generalize.

This failure could happen due to a multitude of reasons based on how your system is designed. This means that we are causing Twilior's API call (sendMeasurement(vendor = Twilior)) to fail because the Octane vendor has issues with it. This directly affects our SLAs and is also a bad customer experience. You are probably thinking "Ok, this isn't too bad?". With a good enough testing framework, this should be at least avoidable in most cases. Let's see the next gotcha.

Gotcha #3: Doubling Latency


The above snippet illustrates the latency problem. RTT refers to "round trip time", or the network time needed to send the request and receive the response (directly proportional to your network quality and speed). It is important to note that this does not include computation time - which in this case is 2 seconds.

So as stated, if we didn't do dogfooding then we would be serving our requests in less than half of the time. On top of that, as the complexity of our sendMeasurementHandler increases we are always taking a 2x hit for the amount of new latency added, which is highly undesirable. How do we go about solving this?

Enter event-based architecture

If you've never heard of the term event-based architecture, it simply means all the communication with the specified system happens in an asynchronous manner using a central, well-partitioned event bus, such as Kafka, AWS SQS etc.

A little system architecture secret: Behind the scenes, Octane is powered by a multitude of micro-services that communicate with each other using Kafka.  The reason we follow this pattern is because it allows us to have low-latency calls, while maintaining an easy-to-scale system that is resilient to micro-service crashes and network downtime.

To get around our the latter two gotchas, we decided to take advantage of this async communication architecture as seen in the code snippet below:


Whenever we receive a measurement, we simply place it on our async framework. Internally, this simply puts it on a queue that will be picked up by the event bus as necessary. Next, it will be processed by a micro-service that reads from this bus whenever it is not busy.  This avoids a network call (hence removing any issues caused by network flakiness), and also makes it completely independent of the actual computation time of the handler. Finally, we have also minimized the chance of an error, because the placeOnEventBus operation is minimal and virtually error free.

We ran basic internal experiments to verify that our assumptions were correct, and it turns that using this architecture reduced latency significantly as expected.

Final Thoughts

Dogfooding is a pretty good value-add for us as it improves our QA process and also allows us use ourselves as our billing platform. If you don't use your own products, why should others?. It might sound straightforward, but the goal of this article was to describe some of the potential pitfalls most people will probably run into when trying dogfooding. There are probably more complicated problems that might arise depending on how you go about it, and I would love to hear about them. Email me with thoughts, questions or problems you ran into: karan@getoctane.io. Until next time!