Backing up all of our data to two different locations was a requirement from the beginning at BitLeap, even before there was much of a product to speak of. In an effort to save bandwidth at the customer’s location, the data is transferred from the LeapServ to a single BitLeap location which is then responsible for transferring it to a second location.
Our first approach to handling the two locations problem was to receive the data from the customer and immediately transfer it to the second location before letting the customer continue. This method worked well for a while but eventually started to show its shortfalls. So many systems were involved in the acknowledgment of a single piece of data that one small isolated problem was able to effect the entire system. This also lead to performance issues as the overhead of receiving a single piece of data was so high.
The first logical step to solving this problem that we devised was to break apart the processes that receive the data from the customer and send the data to the second location. As we suspected, breaking apart these systems solved the performance issues receiving the data from the customer, but almost a little too well as we quickly realized. Data was now able to flow into our servers from customers at a rate much quicker than before. This combined with the newly separated process of transferring all of that data to the second location caused the processor and bandwidth usage on our servers to jump significantly.
One thing to realize about offsite backup traffic is that most of it comes in during the night when our customers are not at work. This means that during the day the processor and bandwidth utilization of our servers is relatively low. Because of this, we decided to accept all of the backup data from customers during the night and only transfer it to the second location during the day to even out the load some what. After making this change it seemed as if we had hit a sweet spot between quickly accepting backup data from customers and sending it to the second offsite location.
After running with the new approach for a while, we found that the queues used to transfer the data between different offsite locations were prone to getting backed up if there was any interruption in the transfer process during the day. While it seemed like the solution was to add more hours to the day, we couldn’t help but notice that there was still quite a bit of unused inbound bandwidth scattered throughout the night at each location that could be utilized for transfer. This is were the fun of traffic shaping comes in.
Stay tuned next time for the shocking conclusion!
Post a Comment