xkcd comic

Amazon Web Services

Amazon Web Services. The Cloud. A real cloud. Not really just vapor anymore :)


AWS Setup

aws configure			# initial config with user credential, prefered region, etc
scp -pr .aws remotehost: 	# duplicate config to another admin client.  dir/file should not be world readable
setenv | grep AWS_		# about 5 env var, include AWS_SECRET_ACCESS_KEY... feel less secure than having the .aws/ config dir above.

output = text
region = us-west-2

# example .aws/config   equiv ENV should set them.
aws_access_key_id = AKamaiIker53rdStAlph
aws_secret_access_key = /LYSquieremuchoWanhWahnFAKE123xfake123Ti

# example .aws/credentials   equiv ENV should set them, but INSECURE.
# awscli v 1.2.9 (on ubuntu) can't recognize this separate file and content is stored in the same config file above.
# the secret_access_key is very important and should be kept very private!!  (so don't put in env var!)
# The AWS_ACCESS_KEY_ID can be obtained from Web Console.  but not the secret access key can only be retrieved when it was first generated.
# Use IAM instead of the root account.  Each user can have 2 key_id/secrete at a time.  
# one annoying thing is that they id cannot be named to help remember which computer have used the info for aws config
# they can be copied and used in multiple computers.  but the .aws dir must be in some safe place!!
# feel like the password protected ssh/pem files provides much better security, but cannot be used with the awscli :(
# env var name

aws configure --profile user2	# create additional profile (eg personal vs work)

complete -C $(which aws_completer) aws		# allows for TAB completion of aws sub commands

# install pip, can be done with windows cygwin's python
sudo python
pip --help

# install aws cli once pip is in place   (Anaconda on Windows comes with pip and can install this successfully)
sudo pip install awscli

sudo apt-get install awscli		# ubuntu, mint now have .deb for the python package

EC2 Commands

aws ec2 describe-regions --output=text
aws ec2 describe-subnets --output=table

aws --region us-west-2 --output=table ec2 describe-instances | egrep  '(Value|PrivateIp|\ Name)'

aws --region us-east-1 ec2 describe-subnets --query 'Subnets[*].[SubnetId,CidrBlock,AvailabilityZone,Tags[?Key==`Name`] | [0].Value]' --filters "Name=vpc-id,Values=vpc-abcd1234" "Name=tag-value,Values=\*HPC\*"

for region in $(aws ec2 describe-regions --query 'Regions[*].[RegionName]' --output text); do echo $region; aws ec2 import-key-pair --region $region --key-name tin6150 --public-key-material "$(cat $HOME/.ssh/" ; done

aws ec2 authorize-security-group-ingress --group-id sg-903004f8 --protocol tcp --port 22 --cidr 24

aws ec2 describe-security-groups 

aws ec2 authorize-security-group-ingress --group-name MySecurityGroup --protocol tcp --port 22 --cidr	# allow port 22 inbound traffic

aws ec2 create-tags --resources i-xxxxxxxx --tags Key=MyNAME,Value=MyInstance    # add a name tag to my instance

aws ec2 describe-instances	# list all instances and their info
aws ec2 describe-instances --instance-id i-30d27590 --output=table
aws ec2 stop-instances     --instance-id i-30d27590
aws ec2 start-instances    --instance-id i-30d27590

Finding instance info from within the VM

lspci		# like see something like
		# 00:03.0 Unassigned class [ff80]: XenSource, Inc. Xen Platform Device (rev 01)

ec2-metadata -i								# Return the instance id of the current VM (AWS Linux)
ec2metadata --instance-id						# ubuntu w/ cloud-utils
wget -q -O -   http://instance-data/latest/meta-data/instance-id	# 
wget -q -O -	# is IANA local link addr when NO DHCP address is received.  

EC2 Pricing

  1. size is best indicator of pricing, m4.xlarge and c4.xlarge are about the same price, with both being more expensive than *.large
  2. VM instanced are categorized as T,M,C,G,R can be thought of Tiny, Moderate, CPU, GPU, RAM. Tiny are cheaper than Moderage. Note that R type start as large, so there are no "small" pricing for this category.
  3. VM with lots of storage are of type I, D
  4. the number after the category is generation number. eg m3.medium uses newer CPU than m1.medium and thus slightly more expensive
  5. OS type matter (due to license?). List below is in increasingly more expensive even if VM type remains the same (eg for m4.large, 2015.09, N.Virginia):
    1. Linux (0.126/hr) [CentOS? Amazon Linux? No OS license fee]
    2. RHEL (0.186/hr = $134/mo)
    3. SLES (0.226)
    4. Windows (0.252)
    5. Win w/ SQL web (0.261)
    6. Win w/ SQL std (0.927)
    7. Win w/ SQL ent (about 2x of SQL std)
  6. Look at free software in Amazon Marketplace is a easy way to see prices for all different types of instances. eg. NCBI Blast AMI
Source: EC2 price list

EC2 vs Google Compute Cloud pricing

Hard to do apple to apple comparison, especially if get into spot/preemptible instance prices.

AWS    bill in 1 hr  increment
Google bill in 1 min increment, min 10 min.  automatic discount for sustained use (seesm to be 24%).

prices are barebone VM, additional charges apply for OS needing license cost, app license, etc.
comparison done on Nov 8, 2015.

       aws/google,     def disk size		aws inst name and price			google inst name + price

1 cpu, 1.0/0.6 GB RAM, 8/10 GB disk		t2.micro  $0.013/hr ($ 9.36/mo) 	f1-micro  $0.008/hr ($5.76/mo)
1 cpu, 2.0/1.7 GB RAM, 				t2.small  $0.026/hr ($18.72/mo)		f1-small  $0.027/hr

2 cpu, 8.0/7.5 GB RAM, 				m4.large   $0.126/hr ($  90/mo) 	n1-std-2  $0.100/hr ($  72/mo)
32cpu, 108/120 GB RAM,				c3.8xlarge $1.680/hr ($1209/mo)	   	n1-std-32 $1.600/hr ($1152/mo)

Additional charges:

Google: $0.40 for each 10 GB persistent disk, per month, charged even when VM is running.
AWS:    $0.12 for each  1 GB persistent disk, per month, charged even when VM is running.

AWS has IOPS limitation for EBS disks.
No inboud/outbound data charges seen so far.  Not sure if S3 has such charges.  
VPN could be separate charges.

Storage for EC2

  1. EBS: Elastic block storage. Think of this as the virtual hard drive used in VMware. Storage attached to specific instance of EC2 (VM). EBS storage does not automatically go away when an instance is terminated, and can be manually attached to another instance if required. $ 0.10 / GB / month
  2. EFS: This provide NFSv4 support (v3?). Thus, provide file access to multiple instance. $ 0.30 / GB-month, calculated by GB/day, added for the month.
  3. SoftNAS: AWS marketplace vendor providing high-performance cloud NAS, up to 20 TB. NFS, CIFS, iSCSI. HA when deploy 2 requisite instance. Implemented on EBS, and need EC2 to host their software. So cost can range from $ 0.01/hr to $ 5.28/hr + cost of EBS storage.
  4. Glacier: for data backup and archiving, extremely low cost. $ 0.007 / GB + cost of xfer out ( $ 0.09 / GB )
    Store an .tar or .zip, immutable. Each one assigned an archive ID.
  5. S3: Simple Storage Service. This is object store. provides web interface to access a given object. no file system interface provided.


Elastic File System - in Beta as of 2015.11
  1. Secure access within VPC.


  • SoftNAS: AWS marketplace vendor providing high-performance cloud NAS, up to 20 TB. NFS, CIFS, iSCSI. HA when deploy 2 requisite instance. Implemented on EBS, and need EC2 to host their software. So cost can range from $ 0.01/hr to $ 5.28/hr + cost of EBS storage.

    Ephemeral storage

    1. *NOT Persistent!!* Files saved will be gone after reboot of EC2 instance.
    2. Physically attachable to EC2 instance, so does behave like a virtual hard disk. Often mounted as /media/ephemeral0.
    3. It is free, but only comes with the larger instances, of increasing size.
    4. better performance than EBS.
    5. Ideal as the root drive of an HPC cluster node, where no storage is needed (AMI is copied to ephemeral disk on boot?).
    6. Ephemeral storage is considered instance storage. But to use this as boot device, need to do so before the host is created, by using a device mapping such as /dev/sdc=ephemeral0.

    Direct-attached storage

    1. Not to be confused with physically-attached storage, which is what EBS is.
    2. It is native to a specific EC2 instance. Likely hard drive on the same physical server hosting the EC2 instance. As such, it is not mountable to a different EC2 instance!
    3. Not shared, so Potentially/Likely better performance than EBS, and less variance in performance.
    4. Offered on beefier EC2 only. Maybe more worthwhile to use than paying for PIOPS EBS.
    5. There *is* SPOF in direct-attached storage.
    6. Persistent data (should be, double check).


    1. EBS emulates a virtual hard drive, so it is mounted by a specific EC2 instance for use, but it lives independent of any given EC2 instance. ie, it is NOT instance storage, and persist after an EC2 is terminated (deleted).
    2. Only mountable in the same availability region. But then it has no replication delay.
    3. Has two tiers. Standard IOPS, and Provisioned IOPS (PIOPS). The latter allow extra payment to get dedicated performance. It is not necessary faster, but will be more predictable (less variable, less likely to have bad performance because other instance is sharing the hardware and hitting it hard).


    1. Web-centric way to access files. Main use case is programmer coding app to access files/objects using AWS S3 API.
    2. It does not emulate a virtual hard drive as EBS does.
    3. Files in S3 is accessible from any AWS Region. It also can be replicated for availability and performance.
    4. Works on "eventually consistent" model. Has replication delay.


    1. Glacier is like S3, much slower, and much cheaper. but has an upload/download charge structure.
    2. intended for archival. may take hours for files to be fetched from (likely tape) before it is usable.

    Using S3 to serve static-content web site

    S3 can be used to host a web site that does not need to serve server-side dynamic content. It is well documented, see overview and Bucket config.
    Be forwarned that each little file retrieval add to the cost. A web site may have very many little files, so this cost may add up!
    • Create a bucket. eg tin6150.
    • Use standard storage, not "infrequenst access storage" or "glacier storage", as acces surchage on the latter are expensive!
    • Upload files to the bucket, set upload details, permissions of "Make everything public". This just means the file's properties will say Grantee: Everyone to open/download, but not edit. View Permissions apparently is not needed.
    • Set bucket property to enable web hosting. This will generate an website end point based on the region the bucket is, eg:
    • Upload will overwrite old files w/o warning. it will set new permission as per latest upload.
    • S3 Pricing details
    • Storage cost isn't bad.
    • GET request are substantially cheaper than POST and PUT request
    • Upload to AWS (xfer IN) is free
    • xfer OUT (ie, visitor retrieving files to see the web site) has a per GB data xfer OUT fee. This is in ADDITION to the GET or POST request fee. Like the Hotel California song, you can check out but you can never leave! :-P
    • Upload a folder works, no need to pre-create a folder.
    • There are options to set a DNS domain to point to the S3 web site, see AWS Route 53 or even a DNS CNAME to the S3 endpoint, eg
    S3 commands
    # S3 buckets are accessible globally, so while hosted in a region, I/O commands can work w/o specifying any region.
    aws s3 ls					# list buckets
    aws s3 ls sn-s3-bucket-oregon-webhosting	# list content of bucket named "sn-s3-bucket-oregon-webhosting"		## cat@grumpy
    aws s3 ls sn-s3-bucket-oregon-webhosting/fig/	# it is more like "ls -ld", add tailing / to see content inside a dir
    aws s3 ls s3://sn-s3-bucket-oregon-webhosting	# prefixing bucket name with s3:// is req with older awscli
    aws s3 ls s3://sapsg				# t6@g 
    aws s3 ls s3://tin6150				# t6@g
    aws s3 ls s3://ask-margo			# t6@g
    aws s3 ls s3://nibr				# cat@grump
    aws s3 sync . s3://tin6150  --acl public-read	# sync is like rsync, skip files already in destination
    aws s3 sync . s3://sapsg    --acl public-read	# xfer-in is free, so okay to test upload to s3 like this :)
    aws s3 sync conf     s3://nibr/conf   --acl public-read		# conf is name of a dir in this eg
    aws s3 sync conf     s3://ask-margo   --acl public-read		# the dirname must be stated in the destination too, or all files in the script/* dir will be placed at one level higher!
    aws s3 sync conf/    s3://nibr        --acl public-read		# whether / is added to explicity state src is a dir.

    ref: S3 commands

    HPC in EC2

    MIT StarCluster

    StarCluster from MIT provides an easy way to create (and terminate) an SGE cluster running on AWS EC2. Characteristics:
    • The AMI is based on Ubuntu 13.04 (as of 2015.12).
    • Utilize Open Grid Scheduler (OGS, fork of SGE), Condor workload management.
    • Programming environment include, SciPy, NumPy, IPython, CUDA, PyCuda, PyOpenCL, OpenBLAS...
    • Provides OpenMPI, Hadoop,
    • A cluster-wide NFS mounted FS. (An additional EBS volume need to be defined in the config, mounted by the master node)
    • IAM "EC2 Full Access" should be granted to the user that need to create nodes that form the starcluster. ref:

    StarCluster setup

    pip install starcluster
    starcluster help
    starcluster --region us-west-2 listpublic 	# list avail AMI
    starcluster createkey -o ~/.ssh/mycluster.rsa  mycluseter
    	# the public key is not returned by the above command
    	# alt, can use -i option to import pre-generated ssh keys.
    	# that key has to be imported into AWS IAM key pairs, or else get strange error about key does not exist in region us-east-1
    # generate starcluster configfile, 
    # edit ~/.starcluster/config with new key
    # AWS info, NODE_IMAGE_ID with ami id in the desired region, NODE_INSTANCE_TYPE.
    starcluster start -s 2 mycluster	# create and start a new cluster named "mycluster".  config read from ~/.startcluster/config
    					# the default config, master node is also an sge exec host
    					# they use a single NIC/IP, no distiction b/w private and public network.  
    					# EC2 allocate a public IP for each node by default.
    starcluster listclusters
    starcluster restart    mycluster	# reboot all nodes.
    starcluster terminate  mycluster	# terminate AMI, stop paying for it.  #EBS remains?
    starcluster stop       mycluster	# only poweroff node, preserving EBS image (/mnt ephemeral storage will still be lost, of course!)
    starcluster start -x   mycluster	# restart stopped cluster, all nodes will come back.
    starcluster sshmaster  mycluster		# login to master node as root
    starcluster sshmaster  mycluster -u sgeadmin	# login to master node as sgeadmin, can issue typical qconf commands from there.

    1. Admin mag article
    2. AWS Marketplace page

    AWS Batch

    • AWS Batch is modeled after the HPC batch job running. Perhaps more like HTC than HPC.
    • No additional cost or special pricing, just need to pay for the EC2 instance needed to run the job.
    • Does not need to setup server to run the batch scheduler (so don't need to pay for an extra EC2 server to host such management (?))
    • Jobs that it run are docker container job. So, maybe a kubernetes thing...
    • Batch has a manager to help bid for spot instances.
    • Could scale job wide for fewer hours rather than in-house HPC that is static size and run for days.
    • Resources are scaled up automatically to satisfy jobs, and scaled down when runnable jobs decreases. One still have to manage min,desired,max vCPU.
    • ECS Agent is used to run containerized jobs.

    Virtualization Tech

  • paravirtual instances (PV) - historically the standard AMI virtualization. Xen is used as the hypervisor.
  • hardware assisted virtual instances (HVM) - used in larger machines to circumvent hypervisor restrictions. Increasingly used for all instances.




    Amazon RDS (Relational Database Service) offers a few database. Notably Aurora, claimed to be MySQL compatible, but with improved performance, cache that lives thru db restart, etc. For even aurora, DB sw still need to be setup by admin... and so performance of DB is limited on the node instance that is running the DB.

    DynamoDB is a NoSQL offering. Fully managed, so just create tables and access data using API. No need to maintain the DB itself, the DB is in some cloud, backed by distributed system. Advertised as single digit ms latency at any scale.

    Other eg of NoSQL DB includes: Hadoop, MongoDB.
    BigTable-based, rather than schema-less: Cassandra, HBase.

    Reference, see also...

    Copyright info about this work

    This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike2.5 License. Pocket Sys Admin Survival Guide: for content that I wrote, (CC) some rights reserved. 2005,2012 Tin Ho [ tin6150 (at) ]
    Some contents are "cached" here for easy reference. Sources include man pages, vendor documents, online references, discussion groups, etc. Copyright of those are obviously those of the vendor and original authors. I am merely caching them here for quick reference and avoid broken URL problems.

    Where is PSG hosted these days?