Strategies for data transfer to Amazon Web Services
Creating a GIS deployment with Amazon Web Services requires you to transfer some or all of your GIS data over the Internet to locations on the cloud. This topic lists some options of where you can store your data on the cloud and how you can transfer the data. It also discusses some factors that affect data transfer time.
Places to store the data
Once you create an EC2 instance running ArcGIS Server, you need to prepare to transfer your data to the cloud. There are several places you can store your data. All the following options incur charges from Amazon that are subject to change and that you should research before making your choice.
EBS volumes—Amazon Elastic Block Store (EBS) volumes are virtual disk drives that you can attach to your EC2 instance to add more storage. In fact, a volume is always attached for you as part of the ArcGIS Server Amazon Machine Images (AMIs). You can configure the size of this attached volume when you build the site in ArcGIS Server Cloud Builder on Amazon Web Services. The ArcGIS server directories are configured on this drive, so when you publish services with the option to copy data to the server, the data goes onto this EBS volume. You can also create other directories on this volume to hold your data.
Amazon S3—Amazon Simple Storage Service (S3) is an Amazon service designed specifically for data storage in the cloud. This storage option has the lowest potential for data failure or loss. You can use S3 as a place for data backup or as a middle ground for data transfer between your on-premises deployment and your EBS volumes. Also, any snapshots you create of your EBS volumes are stored on S3.
- EC2 instance—It's possible to transfer data directly onto your EC2 instance; however, if the instance is terminated, your data from the C: or root drive will be immediately lost. The ArcGIS Server AMI apportions a relatively small amount of space (60 GB on Windows) on the C: drive to discourage data storage on this drive. In contrast, attached EBS volumes such as the D: drive persist when the instance terminates and are a safer option for data storage.Caution:
Do not store GIS data or map caches on the C: or root drive of your EC2 instance in a production deployment.
Options for transferring data to the cloud
Transferring data from your on-premises deployment into the cloud takes time and, in some cases, coordination with your IT security staff. Exporting data to a location on the Internet (in other words, the cloud) is often not as fast or secure as the common data transfers that you do within your local network.
There are many strategies you can use to get data onto the cloud, but if you work with sensitive data, you'll want to make sure you coordinate with your IT staff to make sure your method is secure and approved by your organization. Following are some of your options:
Configure ArcGIS to copy the data when you publish a service—You can configure ArcGIS so that whenever you publish a service, the data for that service is copied to the server. The data is packaged into a service definition file (.sd), transferred into the ArcGIS server uploads directory, and finally unpacked into the ArcGIS server input directory or a database you have registered with ArcGIS Server (as ArcGIS Server's Managed Database). Be aware that this can take a long time and result in the transfer of large amounts of data if you do not limit the extents and datasets used in your map or other resource.
This option does not allow data to be shared between services, nor does it allow data synchronization between the cloud and your on-premises deployment.
Remote Desktop Connection copy and paste—Windows Remote Desktop Connection allows file system redirection wherein your local drives can be mapped to the remote computer. While logged into your EC2 instance on Windows through Remote Desktop, you can open Windows Explorer and copy data from your local drives to your EBS volumes.
To enable file system redirection, in the Remote Desktop Connection window, click the Local Resources tab and check the check box to make your drives available. The wording varies depending on which version of Windows you are using. In Windows 7, you have to click the More button to see the option to make drives available.
If you choose to transfer sensitive data using Remote Desktop Connection, you should ensure that additional layers of security are in place. Older versions of Remote Desktop Connection have been shown to contain security vulnerabilities wherein a computer posing as the server can gain access to your data (sometimes known as man-in-the-middle attacks).
Note:Copy and paste can take a while to transfer data. Do not copy any other file or data before the paste procedure is complete. If you do, the paste terminates and you have to start over.
S3 client utilities—Amazon S3 can be used as a middle ground for moving data from your on-premises deployment to your EBS volumes. To get data into S3, you can use the AWS Management Console or one of the many third-party apps that are designed for easily moving files between S3 and your own computers. Once your data is on S3, you can use the same utility on your EC2 instance to transfer data from S3 onto the instance.
Your own web server—Any data available on the web through HTTP is accessible to your EC2 instance. If you have a web-facing server in your organization, you can place your data on it, then download the data from your EC2 instance. The advantage of this approach is that you can configure security on your web server to limit who can download the data and to encrypt the transaction through SSL.
FTP—You can enable file transfer protocol (FTP) to upload files directly onto your EC2 instance. Beware that standard FTP does not encrypt information and sends passwords in clear text. To safely use FTP, you need to take additional security measures, such as encrypting your FTP sessions with SSL, limiting which users are allowed to transfer data to your instance through FTP, and disabling FTP after your initial data transfer. Some third-party products are designed to help you set up secure FTP connections.
AWS Import/Export—If you need to transfer an enormous amount of data to Amazon, it may be faster and/or more cost effective to ship the data to Amazon on a portable storage device and pay Amazon to load the data directly into S3. Amazon offers this service as AWS Import/Export.
If you consider using AWS Import/Export, you'll need to decide if it's appropriate for your organization's data sensitivity. Any time you put a device in the mail, you run the risk, however small, of the physical destruction or interception of your data. You can mitigate these risks by backing up and encrypting the data. If you still have concerns about whether AWS Import/Export is an appropriate choice for your data, contact Amazon directly.
Amazon works with many Solution Providers, some of whom provide data transfer, storage, and security solutions. See Find an AWS Solution Provider to understand whether one of these companies can help with your cloud strategy. Esri itself is one of these providers and offers various project and implementation services for deploying GIS in the Amazon cloud.
Factors that affect data transfer time
Performance of the above data transfer options can vary based on your physical proximity to the Amazon cloud, the time of day, and the quality of your connection to the Internet.
GIS datasets, especially imagery and map caches, can take large amounts of space and may need to be zipped before transfer, either to reduce the size of the file or to reduce the total number of files for more efficient transfer (especially in the case of map caches). Some S3 client utilities may place limits on the size of any one file you can transfer or the number of individual files you can store. Also, some zipping programs have limits on the amount of data that can be zipped. The zipping time and effort should be taken into account when you choose a data transfer option.
Finally, if using S3, be aware of the limitations on the number of buckets you can create and other restrictions on S3 buckets. Amazon lists these in Bucket Restrictions and Limitations.
Maintaining the integrity of data paths
Any time you move data to a new location, you need to be aware of any paths referencing the data that may also need to be updated. This is a concern with map documents, which may reference dozens of data layers at different paths.
Registering your Amazon EC2 data location with your ArcGIS server can help reduce the effort of fixing broken data paths after publishing. See Registering your data with ArcGIS Server using ArcGIS for Desktop.
Another option is to log in to your instance and use ArcMap to repair the out-of-date paths. ArcGIS for Desktop is included on the ArcGIS Server AMI so that you can easily make the repairs. See Repairing broken data links to learn about updating data path information in a map document.
Another way to reduce the need to repair data connections is to use relative paths in your map documents and store your maps and data in a common folder.