Big Data


HDFS usage
Ref: YDN Hadoop This guide likely written before hdfs split off from hadoop. Thus, the "hdfs" command was called "hadoop".

hdfs dfs -ls			# list user's HDFS BASE dir (/user/$USER)
hdfs dfs -ls /		# list files from root of HDFS
hdfs dfs -mkdir /user/bofh 	# make user's home dir, special priv requeired.

hdfs dfs -put   foo  bar	# copy file/dir "foo" from unix to hdfs and name it bar  
				# -put will copy file to ALL dataNodes.  
				# -put will error if destination file "bar" already exist.
				# -put will recursive copy if "foo" is a directory.
hdfs dfs -put   foo     	# copy file foo from unix.  Destination is HDFS BASE dir since not specified.   ??  or not allowed ??

hdfs dfs -get bar baz		# -get is to retrieve file/dir "bar" from HDFS to unix, saving it as "baz".
				# only -put and -get deal with file exchange b/w hdfs and unix
				# all other commands are manipulating files w/in hdfs

hdfs dfs -setrep		# set replication level
hdfs dfs -help CMD		# get help on command

hdfs dfs -cat bar		# like cat inside HDFS
hdfs dfs -lsr 		# ls -r inside hdfs
hdfs dfs -du path		
hdfs dfs -dus			# du -s, ie display summary data
hdfs dfs -mv src dest		# move WITHIN hdfs
hdfs dfs -cp src dest		# copy WITHIN hdfs
hdfs dfs -rm path		# rm   WITHIN hdfs.  use -rmr for rm -r
hdfs dfs -touchz path		# z for zero
hdfs dfs -test -e|z|d  path	# Exist, Zero legth, Directory 
hdfs dfs -stat FORMAT  path	# 
hdfs dfs -tail -f      bar   	# tail [-f] bar (file inside HDFS)
hdfs dfs -chmod -R 750 path
hdfs dfs -chown -R OWNER path	# chown, if no owner defined, change to me
hdfs dfs -chgrp -R GRP   path

hdfs distcp -help		# read up on distributed cp, it starts MapReduce task to lighten large copy 

hdfs dfsadmin -report 
hdfs dfsadmin -help

hdfs fsck PATH OPTIONS		# check health of hdfs

Apache Hadoop


Apache Hive

Apache Spark

spark troubleshooting
http://n0156.lr3:8080/  - spark master (scheduler, monitor worker)
http://n0161.lr3:8081/  - worker process

http://n0093.lr3:4040/jobs/	- overview of job process, one port per job
spark://n0093.lr3:7077/		- spark protocol for spart-submit to send job to the master

Apache Kafka

Apache Storm


CouchDB, PouchDB, memcached, Couchbase



    su - scidb init all mycluster startall mycluster status   mycluster
    iquery -aq "list('arrays')" 	# list avail arrays.  [] means empty list
    iquery -q 
    iquery -q 'create array X < x: uint64 > [ i=1:10001,1000,0, j=1:10001,1000,0]' 	# creates a test array


Apache Aurora


Parallel Environment

Apache Mesos

AirBnB Chronos (Mesos Framework)


cfncluster is a framework that deploys and maintains HPC clusters on AWS. It is reasonably agnostic to what the cluster is for and can easily be extended to support different frameworks. The CLI is stateless, everything is done using CloudFormation or resources within AWS

MIT StarCluster

Also a way to deploy and maintains HPC in AWS. but cfncluster seems to be where the action is now. see aws.html for sample POC setup session.


RDD, DataFrame

Apache Parquet



Search within the PSG pages:

Copyright info about this work

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike2.5 License. Pocket Sys Admin Survival Guide: for content that I wrote, (CC) some rights reserved. 2005,2012 Tin Ho [ tin6150 (at) ]
Some contents are "cached" here for easy reference. Sources include man pages, vendor documents, online references, discussion groups, etc. Copyright of those are obviously those of the vendor and original authors. I am merely caching them here for quick reference and avoid broken URL problems.

Where is PSG hosted these days?