Cloud OnRamp

Three Pitfalls to Cloud On-Ramping

Cloud OnRampYou’ve decided that you’re going to move to an external cloud platform.  You’ve provisioned the platform with the applications you need, and you’re ready to go.  Except for the few quazillion bytes of existing data that are sitting in your existing data center and need to be moved over to the new platform.

A while back, I discussed how a logistics company and a physical storage device is probably the best way to get large amounts of data onto an external cloud platform.  But what about just using the internet to do this — can it be done effectively?

1) Volume. The first issue is just the volume of data involved and the bandwidth available for moving it.  The first task, therefore, has to be in minimizing the amount of data that needs to be transferred.  Data cleansing followed by deduplication can reduce volumes by up to 80% — transfer times will be proportionately shorter.  Data compression, packet shaping, and other network acceleration techniques can also help in reducing the time required to move the data from the old data center to the cloud. However, this then brings us on to point two:

2) Network Issues. Even a few terabytes would take a long time to move over anything other than a high-speed link dedicated to the organization, and most would be sharing this link with all their other internet traffic.  Therefore, it is important to ensure that the correct priorities are applied to the various types of data being transferred across the connection.  Voice and video will probably already be receiving high priority, with most other data being allocated a “standard” class of transport. The temptation is to look at the mass data of the transfer as being the same as a backup and as such allocating it a low priority.  This will mean that the data will only transfer when there is little to nothing else happening — and the transfer times will be extended.

So, if possible, the best option is a dedicated physical link.  If not, use a dedicated virtual portion of the connection so that you can calculate pretty accurately how long the transfer is going to take.  Only as a last choice should the data be transferred over the same shared connection as your organization’s other data. Why?  Because of point 3:

3) Synchronization. Even when the data transfer is happening as fast as possible, it still takes a finite amount of time.  During this time, the data at the source continues to change — new transactions take place, new data is created.  Therefore, what you end up with at the cloud side is not the same as what you have at the source side, which is a bit of a problem.

A couple of ways around this include the unlikely one of shutting down the application that is creating the data for the period of time that the data transfer is taking place.  This has the obvious, and unwelcome, side effect of halting business.  No — the main way of dealing with this is to iterate the transfer at a delta level.

Assume that you start with 10Tb of data, with each day creating 1% more data (so, 100Gb).  Assume that the original transfer of data to the cloud takes 1 day.  The original data transfer therefore is of the original 10Tb — but this leaves you with yesterday’s data, not today’s.  You now have 100Gb of delta data to transfer across — this is not as fast a task per byte as was the original transfer, as comparisons have to be carried out against what is already there and what has changed.  For the sake of argument, let’s say that this 1% extra data takes 10% of the time to move across.

Now, we have moved closer to our end result.  However, there has been a change to the source data again – that 1 hour or so that it took to move the delta data across has resulted in a further 10Gb of new data being created.

The iterations could go on for a long time — but at a 10Gb level, you’re probably at the point where the last synchronization of data can be carried out while the planned switchover of the application is being carried out over with minimum impact to the business.  The key is to plan on how many iterations will be required to get to a point where a final synchronization can be more easily done.

So — there are three main areas to deal with: data volume can be dealt with through cleansing. data deduplication, and wide area network optimization technologies; network issues can be dealt with through virtual data links and/or prioritization; and data synchronization handled through planned iterations of synchronization, followed by a final off-line synchronization.

Or you could use a logistics company…