Python

xkcd comic about getting python high

Python 101

Getting error when running pip search, like this?
pip search yaml
ERROR: XMLRPC request failed [code: -32500]
RuntimeError: PyPI's XMLRPC API is currently disabled due to unmanageable load and will be deprecated in the near future. See https://status.python.org/ for more information.
It is a real tragedy. pip search is banned cuz of abuse. so can only do web based search at pypi.org. For for info and alternatives, see dev2qa
Note that pip install -r requirements.txt would still work if the package is listed with the correct name(s).


  1. Python, a programming language, where space matters!! With that, files portability between Windows and Unix becomes a huge problem, due to the use of CR/LF vs LF in these platforms.
  2. Python3 is a new language that supersedes Python 2, it adds additional construct, so program written in Python3 won't necessarily run in Python 2.x
  3. ; is largely uneeded. lines ends with LF. ; can be used to separate commands when trying to write one-liner in the interactive shell.
  4. Anaconda: a distribution of python, managed core modules and some 195 libraries. It is especially useful for getting Python in Windows and mac.
  5. Miniconda is a lightweight version of anaconda. don't come with the huge library of anaconda, let user to pull only whatever that is needed.
  6. Pypi - Python Package Index, intended to be comprehensive catalog of all open source pyton packages.
  7. pip - pypi install packager?? as of 2015, mostly just use pip
  8. iPython: interactive shell for python (and other lang now).
    pip install ipython
    pip install ipython[notebook]
    ref: http://ipython.org/install.html
  9. Jupyther/iPython notebook: This allow writting text interspearsed into python code. good for testing ideas, data crunching and visualization type of project. Anaconda comes with this, and typically run the server at http://localhost:8888
    Ref: nbviewer
    And this OUseful blog describes 7 ways of running iPython notebooks. I particularly liked CoLaboratory implementation on Google Drive. Authorea support for iPhython notebook (as part of its web authoring platform) on the cloud was pretty neat too.
  10. No pointers?? see below
  11. Python2 FAQ is a good read once beyond the syntahtic sugar and need to know more internals in a real programming project.

Tools from Python Libraries

Python idiosyncracies

If running a .py script and get an error of " : File not found", check to ensure that the python script does not end with DOS ^M characters. If needed, cat old_script.py | dos2unix > new_script.py and run the new script. It is a weired error, and I thought most program can handle ^M these days...

#!/usr/bin/env python
many python script starts as that. It effectively look at user's Environment variable and find out what (where) python is defined and run it as the interpreter. Calling #!/home/username/bin/python may not always work, as PYTHONLIB won't be setup (unless done in the code).

Python can both be interpreted or compiled into byte-code.
Typically .pyc is produced on execution now, so first run incur a JIT compilation delay.

Python 2 vs Python 3

Things to watch out for to write code that is more portable between python2 and python3

avoid has_key() in python2.
ie, avoid dict.has_key(k)
instead use k in dict.iterkeys() or simply k in dict which works in both python2/3

use print(x) rather than print x, the latter does not work in python 3.

float division: eg, 27/3, in python3 will automatically floating division, python2 will assume int unless use:
from __future__ import division
or
27/float(3) 


python3 can use _ as thousand separators in numbers (instead of comma), and it doesn't have to be group of 3, it is for human reading and stripped by the interpreter. Also note that it is *NOT* a "decimal point" like cents and dollars when only two digits is in a group. (Why is Spanish and some other lang reverse role of comma and period?) eg:
>>> print( 5_000_111_000_222_021 - 4_20010 )
5000110999802011
>>> print( 5_4321 + 100_00 )
64321
>>> 

Distributions of Python

  1. python.org. avail for Windows, Linux, ...
  2. anaconda, by continuum.io. Win, OS X, Linux. Free for commercial use too. Include 300 popular packages. Additional packages can be installed via pip (eg awscli).
  3. ActiveState (mostly for windows? personally try to avoid, even though nothing wrong with it really, just kinda non standard).

Environment

  • PYTHONPATH - module search path
  • sys.path - this should be system path, but https://docs.python.org/2/tutorial/modules.html#the-module-search-path made sound python modules will be searched in this path... (cuz they are essentially python programs?)
  • PYTHONSTARTUP - interactive startup file (commands listed here would be run as if they were typed in interactive shell).
    # one good way to use the environment's PATH, but if not set, can at least have some default.
    # https://districtdatalabs.silvrback.com/getting-started-with-spark-in-python
    import os
    import sys
    
    if 'SPARK_HOME' not in os.environ:
    	os.environ['SPARK_HOME'] = '/srv/spark'
    
    SPARK_HOME = os.environ['SPARK_HOME']
    sys.path.insert(1, os.path.join( SPARK_HOME, "python", "build"))
    
    
    if 'PYTHONPATH' not in os.environ:
            os.environ['PYTHONPATH'] = "/home/system_web/local_python_2.7.9/lib/python2.7/site-packages/"
    PYTHONPATH = os.environ['PYTHONPATH']
    sys.path.insert(1, os.path.join(PYTHONPATH, "/home/system_web/local_python_2.7.9/lib/python2.7/site-packages/", "/opt/python/2.7.9/lib/python2.7/site-packages/"))
    ## not sure if insert will check for duplicates...
    ## it will certainly complain if it doesn't exist.
    
    
    # run linux os command from inside python:
    import os
    os.system( 'echo $PYTHONPATH' )		# this can run os command, but won't get the env var settings
    
    os.system( 'env | grep PYTHON' )
    
    

    Setting up the program on NixOS

    System-wide, edit /etc/nixos/configuration.nix
    environment.systemPackages = with pkgs; [
        python27Packages.ipython 			# does NOT provide python
        python34Packages.ipython
        python27Full 				# ipython does NOT depends on this package
    ]
    
    As user:
    nix-env -i python-2.6.9
    

    Setting up the program on Windows

    http://www.howtogeek.com/197947/how-to-install-python-on-windows/ has a step-by-step install, including how to setup environment variables.
    https://www.continuum.io/downloads is download page for Anaconda, include version for winsows (and os x, linux).

    conda

    need to have anaconda or miniconda installed, which typically setup some env in .bashrc then can do:
    conda info --envs	# my installed conda env available for activation
    
    
    conda update -n base -c defaults conda		# but fails in centralized software NFS mounted env ?
    
    # conda create --name abricate
    conda create --name epi
    conda activate epi
    conda install -c bioconda abricate
    conda remove  abricate
    
    
    # tbd...
    conda install -c conda-forge -c bioconda -c defaults mlstn
    
    
    eg of adding channel to get more sw
     conda config --add channels defaults
     conda config --add channels conda-forge
     conda config --add channels bioconda
    

    PIP

    PyPI = Python Package Index - equiv of CPAN.
    Allow for installation of python library installation using "pip"
    sudo yum install python-pip
        -or- python setup.py install 
    pip install easybuild
    
    pip list
    pip show      easybuild
    pip uninstall enum
    pip   install enum34
    
    Libraries are installed to /usr/lib/pythonN.N/site-packages/
    List installed modules/librarys/packages
    pip list
    import pip
    sorted(["%s==%s" % (i.key, i.version) for i in pip.get_installed_distributions()])
    
    Ref: http://stackoverflow.com/questions/739993/how-can-i-get-a-list-of-locally-installed-python-modules

    Python Environment/Virtual Environment

    (Contrast this with Continuum's Conda, which do this and some more)
    python3.6 -m venv ./venv 		# create a virtual env (python 3.6)
    source ./venv/bin/activate 
    pip install -r requirements.txt		# list all pip install packages in a requirement file
    
    ## below are old school, obsolete way to create/invoke virtual env
    pyvenv ~/local_python_3.4		# create a virtual env (python 3.4)
    					# in python 2.7, use virtualenv ~/local_python_2.7 instead
    					# create vir env once for each version of python being used
    source ~/local_python_3.4/bin/activate	# activate virt env (change path accordingly for diff version of python)
    pip install  scipy			# install module into virtual env using pip  (eg for installing scipy)
    easy_install scipy			# install module into virtual env using easy_install (alternate of pip, don't need to do both)
    
    
    example requirements.txt
    docker
    #re
    regex
    shutils
    argparse # version 1.1
    #logging # version 0.5.1.2  # neither of these logging worked for pip 21.3.1 (python3.6, in 2022-007-10)
    #logging==0.4.9.6
    

    Python Module

    Python libraries are provided as module, which can be imported. Python search for modules listed in the environment variable PYTHONPATH.
    Types of modules:
    when writting python modules, best if it does not output anything. else, when consumer do import module_foo it would essentially execute codes in that module and output things that may not be desirable.
    https://docs.python.org/2/tutorial/modules.html#the-module-search-path has a good overview of modules:
  • __name__ contains the module itself
  • if __name__ == __main__ : # module is being executed directly, can place desired execution statement in here
  • when import module, all statement and definition will be executed when the import is run (once). This is why if there are any print statement in the main body of the module, they will be printed out at the import
  • dir( __builtin__ ) list all the names defined by python built-in. Can use dir() with any modules.

    Packages - python dotted modules name. ie, provide hierarchy. https://docs.python.org/2/tutorial/modules.html#packages

    Python Language

    
    0 is the index of the first element (like perl.  unlike awk, which is 1).
    
    [ ] = list.  ordered items.  think of array in most languages., 
          in python, it behave somewhat like a stack.  ie think push/pop.  
          [].append() add items (push).
          in reality is is a linked-list.  items can be removed from the middle of the list..
          	myList = [ 'a', 'b', 'c', 'c' ]
    	myList[1] 	# evals to 'b'		# array syntax, 0-based index
    	myList[-2]				# -ve wraps around, return 2nd from last item
    	myList[1:3]				# slice
    	myList.append( 'e' )			# add item to list
    	del myList[2]				# strink list
    	L2 = [ 'ab' ['cde', 'fgh' ] ]		# nested list
    	len(L2)					# length of list, in this eg returns 2
    	L2[i][j]				# 2D array index for nested array.
    	for x in myList 			# items will be expanded for consecutive x
    
    ( ) = tuple, contain ordered elements.  *immutable*  
          Strings are implemented as tuples and are immutable.
    	point = (x, y)
    	t1 = ( 2, 2 )
    	t2 =   2, 4, 2, 8  			# () syntax is optional when there is no ambiguity
    	t3 = ('xy', ('abc', 'def', 'ghi') )	# nested tuples
    	t3[i][j]				# 2D array syntax works on nested tuples as well
    	t2[1:3]					# slide syntax works on set/tuples too
    	for x in t2	 			# items will be expanded for consecutive x
    	items = set( myList )			# dup 'c' will be removed
    
    { } = dictionary/hash.  key -> value list.   eg ENV[HOME]  = '/nfshome/tin'      # Perl: %ENV{HOME} = "/nfshome/tin"
    	dictionaries are mappings, not sequences.
    	codon[ATG] = 'lysine'
    	codon.keys()				# or iterkeys() or both?
    	codon.values()
    	len( codon )
    	resultTable['species']['homo sapiens'] = 1
    	resultTable{ 'species' : { 'homo sapiens': 1} }	# nesting, what is really happening for 2D hash above
    	for k in codon : 			# same as for k in codon.iterkeys()
    		print codon[k]			# iteration for hash is automatically on hash key 
    
    Additional container datatypes, see 
    python3 collections
    
    namedtuple()  
    ChainMap
    OrderedDict
    etc
    
    
    Things that evaluate to False: 
    False    	# build-in boolean. but does not take FALSE
    0
    0.0		# float
    0j		# complex
    bool( 0j )	# type-cast
    ""		# empty string
    []		# empty list
    {}		# empty dictionary
    ()		# empty set
    
    when __len__() is 0		# eg user-defined function return 0 length list
    when __bool__() is False	# eg user-defined function returning false as boolean value
    
    
    constants
    https://docs.python.org/2/library/constants.html
    
    True		# build-in bool type, does not take TRUE. 
    False
    None		# types.NoneType  a function that should return an object but just issue "return" will get None. kinda like NULL
    NotImplemented
    Ellipsis	# used in slicing syntax
    __debug__ 	# true if not started as python -O 
    
    
    * and **
    http://stackoverflow.com/questions/3394835/args-and-kwargs
    *  eg *args 	# list of args
    ** eg **kwargs	# dictionary (key, val) variable list of args
    
    def fn1( *args ) :
    	enumerate( args )
    	# *args  is for variable number of arguments
    def fn2( **kwargs ) :
    	for (name,val) in kwargs.items() :
    		print( name )
    	# **kwargs is for variable number of named arguments
    
    
    Strings
    
    % is the new magic in python , but it is old magic.  new one is {}
    print( "most c-style string %s works" % stringvar )
    print( "num %d, fixed point %8.2f, exponent %12e"  % ( 123, 6.1234, 0.0000123 ) )
    ref https://docs.python.org/2/library/string.html#format-specification-mini-language
    
    print( "Total rows processed: {:,}".format(rows )  )  # {:,} provides thousand separators$
    ref https://pyformat.info/
    
    strings, like tuple, are immutable.
    
    if 'abc' in StringVariable : 	# searching string to see if it contain a substring:
    if 'abc' == StringVariable :	# the two strings are the same
    if  Foo  is  Bar :		# see if two objects are the same (which would then means same string, but this is OBJECT comparison!)
    
    "" vs ''	# very subtle difference that i have yet to hit.  It is NOT like shell where variables are not evaluated inside ''
    [] vs ()	# really depends if function is expecting a list or a tuple/set
    
    Strings examples:
    
    description = """Topspin NMR software (data processing option only)"""
    
    """this can be multiline string
    and can serve as 
    comment out code
    etc"""
    
    '''here is another multiline string
    that includes line break'''
    
    Note that while multi line string can be treated as multi-line comment, the indentation matter!
    the quotes must start at the right indent level of the preceding line.  
    if it starts flust left margin, it can breaks code
    
    source = [ 'topspin.%s.tgz' % version ]
    
    %s %s % (path,version)
    
    install_cmd = [ 'tar xfzp %s/%s' % (source_urls,sources) ]     ## file:/// screws up tar
    
    sanity_check_paths = {
        'files': ['bin/%s' % x for x in ['moe', 'moebatch', 'chemcompd', 'rism3d', 'sdwash']],
        'dirs': [],
    }
    
    
    postinstallcmds = [ 'pwd', 'ls -la', 'touch TesT.txt', 'mkdir %(installdir)s/prog/curdir/wongja7', 'chgrp emv-structchem prog/curdir', 'chmod g+w prog/curdir' ]
    
    
    
    modextravars = {
        #'TOPSPIN_HOME': '/usr/prog/topspin/3.5pl2',
        'TOPSPIN_HOME': '%(installdir)s',
    }
    
    
    toolchain = {'name': 'dummy', 'version': 'dummy'}
    
    

    Globals, Module's var

    
    Best way to make modules variables?
    This maybe one way, which is what i used in taxo reporter.
    
    Define variable at top of module, and comment that other who import it would change it?
    Similar in spirit to __debug__ and __builtin__
    
    import mydb
    mydb.foo = bar
    
    this way, bar could be set as cli args (eg parsed by argsparse and many file path can set on run time, yet have some defaults defined as the module's global var)
    
    
    cross-mudule var discussion suggested:
    __builtins__.foo = bar
    which may seems to be done enough, but new version of python may run into conflicts.
    Note __builtin__ is global counterpart that need to be import before use.  python also changed this to builtin.
    
    
     
    Python2 FAQ 
    recommends the creation of  a global module for the project, calling it config.py or cfg.py, 
    put all variables there,
    and have all consumer refer to it.
    For a large project with multiple, cross-module references, this avoid a spagetti of "globals" in each module .py file.
    OOP's use of mutator/constructor to set them isn't necessary.  Just modify the var, python don't offer protection, just conventions.
    
    

    Scoping rules

    LEGB Rule.
    Local
    Enclosing function locals
    Global   (module)
    Build-in (python)
    
    Before changing a global var inside a fn, must first declare var as global 
    
    Python3 added a nonlocal clause
    Python Scoping rules discussion
    
    
    

    Snipplets in stand alone program

    # tab nanny 
    python -t   # display warnings
    python -tt  # display errors
    
    # use SPACES in python!!  
    # avoid TAB, which is treated as 8 spaces.
    # space is what delineate a block.  
    #
    # code indented 4 spaces is at diff block level than those with 2 spaces !!
    # also note the use of :  after evaluation of condition, the else clause
    if ( A < B and C < D) :
        print( "and will be optimized, C < D is evaluated only if A < B is True" )
        print( "python &, | are  bitwise operator" )
        print( "this is still part of the if-condition" )
    elif ( P == Q ):
        print( "string and numeric equality is tested by ==" )
    elif ( P != Q ):
        print( "!= can be used to test whether two objects are different" )
    elif ( 1.5 < X < 4.8 or 178 > Y > 188 ):
        print( "range test can be carried out as condition evaluation" )
    else:
        print( "final else part" )
    print "this line is beyond the end of the if/elif/else block"
    # note there is no brackets or endif command to delineate the block !!
    
    
    eg for-loop
    
    for x in list :
        cmd1
        cmd2
    cmd3
    
    use `continue` to jump to next iteration
    
    while( X < 10 ) :
        cmd1
    cmd2
    
    
    # logical operator 
    # just simple word, no all caps, no use of && ||   (editor will color these reserved word differently)
    and
    or
    not
    
    
    # string equality comparision using == 
    txt = "abc"
    if( txt == "abc" ):
        print( "match" )
    
    
    # import regular expression (regex) lib
    import re
    
    # this is closest to perl re search
    m = re.search(r"(\w+)(Jul)(\w+)", "foo_Jul_bar")
    
    if m : # ie execute only when a match is found
    	print( "YES match found" )
    	print( m.group(0) ) # "foo_Jul_bar", ie the whole regex match
    	print( m.group(1) ) # "foo_", perl's \1
    	print( m.group(2) ) # "Jul" , perl's \2
    	print( m.group(3) ) # "_bar", perl's \3
    else : 
    	print( "NO  match found" )
    
    #re.match(...) match only starting from the beginning
    
    
    # get command line argument
    import sys
    option1 = sys.argv[1]
    # argv[0] is the name of the command, eg full path of python, or script name
    
    
    # example for enumeration and 2D hash 
    
    # enum functional style, need python3 
    # https://docs.python.org/3/library/enum.html#functional-api
    
    
    from enum import Enum
    
    RankSet = Enum( 'Rank',   'species genus family order superkingdom' )
    RankSet = Enum( 'Rank',  ['species', 'genus', 'family', 'order', 'class', 'phylum', 'superkingdom', 'no rank', 'NoLineageData']  )       
    
    def example2Dhash( giList ) :
        # 2D dictionary is really a hash nested inside another hash
        # simple usage can use a decent format.  
        # but initialization is pretty hairly, 
        # Under some circumstance may not need to init the 2D dictionary, 
        # but in this eg there 2D hash is evaluated before it is set
        # in the line "if lineage in resultTable2[rank]:"i
        # therefore init is needed (or add more test condition before the if-line).
        # may really want to create a class, and go with OOP for at least this data structure...
        # 2D hash ref: 
        # http://stackoverflow.com/questions/3817529/syntax-for-creating-a-dictionary-into-another-dictionary-in-python
        # http://www.python-course.eu/python3_dictionaries.php
    
        resultTable2 = {"species": {}}
        resultTable2 = {"NoLineageData": {}}
    
        for nom in RankSet.__members__ :
                    resultTable2.update( { nom: {"NoLineageData":0} } )     # seed both hash keys, may not need this complication for init sake
                    resultTable2.update( { nom: {} } )                      # seed only first hash key
        # other example of init elements of the 2D hash:
        #resultTable2.update( { "NoLineageData": {"NoLineageData":0} } )
        #resultTable2 =       { "species": {"homo":0} }
        #resultTable2.update( { "species": {},         "genus":   {},          "family":  {} } )
        #resultTable2.update( { "species": {"HOmo":0}, "genus":   {"GEnus":0}, "family":  {} } )
    
        # if did not initialize the 2D hash above, assignment below would fail.
        for gi in giList:
              for rank_item in RankSet:
                    print( rank_item.name )      # .name ref https://docs.python.org/3/library/enum.html#programmatic-access-to-enumeration-members-and-their-attributes
                    rank = str(rank_item.name)    
                    lineage = getLineageByGi( gi, rank )
                    dbg( "%s \t %s \t %s " % (gi, rank, lineage) )
                    if lineage in resultTable2[rank]:               # python3 changed  has_key to "KEY in" python3 https://docs.python.org/3/library/2to3.html?highlight=has_key#2to3fixer-has_key
                        resultTable2[rank][lineage] += 1		
                    else:
                        resultTable2[rank][lineage]  = 1
                        ###resultTable2.update( { rank }: {[lineage]: 1} )	# don't really need this convoluted syntax!
              #for-end rank_set
        #for-end gi
        print( resultTable2["species"]["Aedes pseudoscutellaris reovirus"] )
    
    
    Ref for Enumeration :
  • tech blog cover .name, .value, Enum( n ), Enum['name']
  • python3 standard library autonumber eg
    
    # Auto numbering Enumeration in a class, so as to be able to define functions
    # It demo some construct, but a hash maybe simpler and less overhead
    # Enum was a python3 feature, thus
    # in python2, need to "pip install enum34"  (which is diff than enum module, out of fashion now)
    
    from enum import Enum
    
    class AutoNumber( Enum ) :
            def __new__( cls ) :
                    value = len( cls.__members__ ) + 1
                    obj = object.__new__( cls )
                    obj._value_ = value
                    return obj
    
    class RankSet( AutoNumber ) :
            species         = ()    # order in this list matter!!
            genus           = ()    # RankSet.__x__.name
            family          = ()    # can add other ranks in middle if desired
            superkingdom    = ()    # code expects sk to be highest
            #'no rank'       = ()   # can't do this, but new class-way of RankSet should not need this anyway
    
            @classmethod
            def getLowest( cls ) :
                    #(name, member) = cls.species
                    #return cls.species     # return RankSet.species  (what is needed programatically for getParent() etc
                    return cls( 1 )         # know that lowest rank in Enum class starts with 1
            				# http://www.tech-thoughts-blog.com/2013/09/first-look-at-python-enums-part-1.html
                    #return cls.species.name        # return species  # http://stackoverflow.com/questions/24487405/python-enum-getting-value-of-enum-on-string-conversion
                    # below will do the equivalent, but much slower
                    for (name, member) in cls.__members__.items() :
                            if member.value == 1 :
                                    return name
                    #return cls.__members__
            @classmethod
            def getHighest( cls ) :
                    return cls.superkingdom      # how to use value=max ??
            # getParent(species)
            @classmethod
            def getParent( cls, rank ) :
                    if( rank == cls['superkingdom']  ) :
                            return None         # return None, as no parent for sk
                    return cls( rank.value + 1)
                    # below will do the equivalent, but much slower
                    for (name, member) in cls.__members__.items() :
                            if member.value == rank.value + 1 :
                                    return member
            @classmethod
            def getChild( cls, rank ) :
                    if( rank.value == 1 ) :
                            return          # return None, as no child for species
                    return cls( rank.value - 1)
                    # below will do the equivalent, but much slower
                    for (name, member) in cls.__members__.items() :
                            if member.value == rank.value - 1 :
                                    return member
    
    
    # RankSet class end
    # RankSet is meant to be a static class, not to be instantiated.
    # support calls like these:
    #        RankSet.getLowest()                            # RankSet.species
    #        RankSet.getLowest().name                       # species
    #        RankSet.getParent( RankSet.getLowest() )       # RankSet.genus
    #        r = RankSet.getChild( RankSet.getLowest() )    # get None when "out of range"
    #        if r is None :					# r == None works, but may break when == gets overloaded
    #                print( "got None from RankSet fn call..." )
    #        RankSet(1).value  RankSet(3).names   RankSet['genus']  are valid attributes
     
    
    Snipplet with example of namedtuple ::
    def eg_of_create_namedtuple() :
        giList = { }                
        f = open( filename, 'r' )
        for line in f:
            lineList = line.split( '|' )
            g = lineList[1]                         # python list index start at 0
            GiNode = namedtuple( 'GiNode', ['Freq', 'Taxid'] )
            if g not in giList :
                    taxid = getTaxidByGi( g )
                    giList[g] = GiNode( Freq=1, Taxid=taxid)
            else :
                    #(freq, taxid) = giList[g]  # this works
                    gin = giList[g]		    # but this keep to the spirit of namedTuple as an entity
                    giList[g] = GiNode( gin.Freq+1, gin.Taxid )
        f.close()
        return giList
    
    
    def egConsumer_of_namedtuple( giList ) :
            for g in giList :
                    print( "looking at gi:%s \t with freq: %s \t and taxid=%s" % (g, giList[g].Freq, giList[g].Taxid) )
                    if giList[g].Taxid not in resultTable4[currentRank] :
    			# the "key" to the namedtuple is available here even when it is not defined here
                            parentTaxid = getParentByTaxid( giList[g].Taxid )
                            rankName = getLineageByTaxid( giList[g].Taxid, currentRank )
                            node = TaxoNode( parentTaxid, rankName, giList[g].Freq, giList[g].Taxid )
                            resultTable4[lowestRank][giList[g].Taxid] = node
    
    
    
    # reading text file
    
    f = open( filename, 'r' )
    print f			# print whole file
    for line in f:
            print line$
            lineList = line.split( ',' )
    f.close()$
    
    
    # write to file
    
    outFH = open( outfile, 'w' )
    outFH.write( "typical write method\n" )
    print( "print redirect write method, need to add 'from __future__ import print_function' in python2 to work" , file = outFH )
    displayText = '{0: ^50}'.format( entry )
    print( "%5d \t %8.4f %% \t %s" % intNum, floatNum, stringVar , file = outFH )
    outFH.close()
    
    
    Sort
    The python Sort HowTo is a concise read on how to sort iterables by specifying which field to use as key. __repr__ ...

    mylist.sort() sorts in-place, so save space and slightly faster. return None. sorted(mylist) returns a new sorted list, so a tad slower, but said to be not too significant.
    By default, Python use timsort, an optmized mergesort. It is heavily optimized on sorted input and can return as fast as o(N-1). Typical performance is lg(N!).
    More info at http://stackoverflow.com/questions/1436962/python-sort-method-on-list-vs-builtin-sorted-function
    ...
    
    

    Functional Programming in python

    Python supports imperative(procedural), OOP, as well as functional style. Since it is not dictated/required, hybrid approach is possible. A few observations:
    1. avoiding side effects (core of functional programming) may not always be possible. eg. printing message to screen, writting to file.
    2. functional programming center on being stateless. easier to achieve for a function with specific input and produce output deterministically. but the inside of the funciton may need to be stateful for more complex tasks. Well, sorting is complex, can be done procedurally (bubble sort) but can also done by divide and conquer without state, as can be done via merge sort.
    3. focusing on stateless, functional programming is in this sense at opposite spectrum of OOP, which is object with methods to provide internal state change.
    The following slide from my colleague Wes provide the gist of FP in Python:
    
    lambda functions: create anonymous functions
        	addFive = lambda x: x + 5
        	addFive(8) 		# result: 13
    
    map()
        	map(func, sequence) 	# Applies func() to every element of sequence.
    
    filter()
    	filter(func, sequence) 	# Returns elements where func() returns True.
    
    reduce()
    	reduce(func, sequence) 	# Reduces a list to a single value.
    	sum = reduce(lambda x,y: (x+y), [1,2,3]) 	# result: 6 
    
    
    list comprehensions: syntactic sugar, clearer than map() or filter()
    	[x.upper() for x in seq]              vs.    map(lambda x: x.upper(), seq)
    	[x         for x in seq if x > 0]     vs. filter(lambda x: x > 0,     seq)
    
    
    collect() 	# Return a list of all elements
    
    maybe useful books...
    
    http://www.amazon.com/Guide-Functional-Python-Comprehension-Constructs-ebook/dp/B00CUZDOSI/ref=tmm_kin_swatch_0?_encoding=UTF8&qid=1453132773&sr=8-8
    
    concise intro to functional programming.  likely using python as construct.
    about 45 pages.  maybe better than some web stuff?
    talks about lambda fn, map/reduce/filter, then go into recursion, comprehension, generators.
    
    the above is Section III of Treading on Python Volume 2: Intermediate Python
    
    (seems like I don't like either).
    
    
    http://www.amazon.com/Functional-Python-Programming-Steven-Lott/dp/1784396990/ref=sr_1_1?ie=UTF8&qid=1453132773&sr=8-1&keywords=functional+programming+python
    
    start out with procedural/functional hybrid.
    maybe easier to follow to get more functional code into programs.
    some 330 pages.  dive into many specific of iter(), where they are used, etc.
    worthwhile if start programming a lot in python.
    
    
    PS.  LISP is the early functional language.  code was pretty hard to read.  Erlang is more modern.  CouchDB is coded in Erlang.
    
    

    iterators

    even if don't really want to get fully functional, understanding iterator goes a long way in understanding many procedural constructs.
    
    eg.  for X in Y is really for X in iter(Y) 
    list and dictionaries are iterable.
    
    dictionaries especially!
    m = { 'jan', 1, 'feb', 2, 'mar', 3 }
    for key in m: 		# same as for key in iter(m)   
        print key, m[key]   
    
    			# side note: python2 allowed 
    			# if m.has_key( k ) 
    			# the has_key() is no longer avail in python3
    			# so use the syntax of
    			# if k in m  
    
    iter( m )		# create iterator from dictionary.  see  https://docs.python.org/3/library/stdtypes.html#typesmapping 
    m.iterkeys() 
    m.itervalues()
    m.items()		# python2 use m.iteritems()
    
    in 2D dictionaries... ??
    table = { 'species', { 'HBV', 13, 'BK', 28, 'HIV', 14 }, 
              'genus',   { 'H',   27, 'B',  28 }
              'family',  { 'tot', 55 }
            }
    
    ranks = table.iterkeys()
    
    familySum = table[genus].itervalues()
    
    table.iteritems()  ??
    table.iter()     
    
    
    there are the 
    iter()
    tuple() 
    that help understand list/tuple generations/conversion. 
    
    
    Ref: Python 2 - Functional - iterators

    Generator and Comprehension

    
    This is probably key to wrap head around functional programming.
    
    () is for generator ... return iterator
    [] is for function  ... return list
    the content inside the parenthesis and brackets will tell it is not tuple or dictionary/hash
    
    ( obj.count for obj in list_all_objects() )
    
    

    Ref: Python 2 - Functional - generator...

    Object Oriented Programing in Python

    OOP, especially data structure with functions to modify its state, is like the opposite of Functional Programming. GUI are probably natual with OOP, but biz logics probably better with FP, and Procedural approach good enough. Python modules provides encapsulation and separation, yielding some benefits of OOP w/o the altered logic imposed by classes. see http://docs.python-guide.org/en/latest/writing/structure/
    class myClass(parentClass1, parentClass2) : 
    	classwideVar = "this is shared by all object/instances of this class. "		# be careful with this, not like Java!
    	def __init__( self )
    		instanceVar = "this is instance specific"
    	def fn( self ) : 
    		print( "hello world" )
    		# super() refer to parent class
    
    parentClass can be blank if not inheriting anything. this is defined in the class clause, obj declaration need not state anything here. standard data type can be used for parentClass. eg object, Enum, multiple parent classes can be listed, (comma?) delimited.
    x = myClass()
    xf = x.fn	# this is valid!  a method name is an attribute of the class... this define an alias to the function...  
    xf()		# actually calls x.f()
    
    myClass.fn(x) 	# this is what is happeneing when calling x.fn(), which is why first param of fn is called self.
    
    Data attributes override method attributes with the same name !!
    Use some standard to avoid bugs, eg verbs for methods, nouns for data.
    well, Java says data should be private and accessed via methods provided by the class...
    
    class Pizza(object):
        shape = "round"					# ie, all pizzas will have the same shape.  
        favoriteIngredient = "pepperoni"
        def __init__(self, ingredients):
                self.ingredients = ingredients		# variables comes to live when they are first executed
    
        @classmethod					# define a class-static method, ie not variable by object instantiation
        def getFavoriteIngredient( cls ):
        	return cls.favoriteIngredient
    
    
    p = Pizza( "pineapple" )
    print( p.ingredients )					# attributes in python are "public" in the C++/Java nomenclature
    							# nothing in python enfoce data hiding, it is all done by convention!!
    
    print( Pizza.shape ) 
    
    # class is object too in python!
    # Exceptions are ... ??
    
    @staticmethod
    
    @classmethod
    
    @abstractmethod
    
    nonlocal
    global
    
    
    
    ref: static class in python tutorial
    Python 3 tutorial on classes

    DataFrame (Pandas), DataSeries

  • DataFrame is essentially a table (2D).
  • Operations (methods) work on all elements of a given column. so avoid having to write iterative loops.
  • DataSeries is a different data structure and has different methods.
    These are horizontal? But not exactly 1D?
  • 
    import pandas as pd
    unemployment = pd.read_csv("data.csv")
    myTable.to_csv("path/result.csv")  # save result, export to csv
    
    
    slices (return another dataframe) vs loc/iloc (return a data series)
    pretty confusing here.
    also, does it mutate the object (dataframe) like method would?  
    or just return a new data frame that is diplayed by jupyter notebook, but otherwise discarded if not saved to a new dataframe.
    
    Note that for loc, ending index is NOT included.  But it is included in iloc. !! 
    
    [...] is for slicing
    [[...]] ??
    
    merge.
    Used to join two dataframes.  
    This is essentially a JOIN in datababase parlance.
    Left/Right inner/outter applies, which may generate really strange looking tables.  RTFM.
    
    inplace=True # edit table in place
    inplace=False  # good for transient display?   don't save into existing table, saving need assignment to new table
    
    unemployment = unemployment.drop(... , inplace=True, ... )   # drop column 
    dropna()  # drop (rows?) with missing value.
    
    unemployment['en_name'].unique()  # return unique country names
    unemployment['en_name'].nunique()  #  think of count( ...unique() )
    
    unemployment['unemployment_rate'].isnull().sum()  # give a count of number of rows where column unemployment_rate is null (ie missing data).
    
    
    .reset_index(drop=True, inplace=True)    # eg before plotting, good to reset the row index if done work to remove data.
    index usually used as x values in plots, thus sequential indexing would be nice (or else get gap?)
    
    pd.to_datetime('1868/3/23')  # in yyyy/m/dd format!! :)
    pd.to_datetime('3/23/1868', format='%m/%d/%Y')  # specify format
    return a timestamp object.
    
    
    GroupBy
    unemployment.groupby('name')['unemployment_rate_null'].sum()
    
    
    
    
    
    Ref: Berkeley D-Lab
  • Introduction to pandas
  • PySpark

    PySpar, Python, SparkSQL and submitting job to a Cloudera YARN cluster
    (More info about these technology in the BigData page.
    
    from pyspark import SparkContext
    from pyspark.sql import SQLContext, Row
    from pyspark.sql.types import *
    
    def main() :
            sc = SparkContext( appName='pyspark_yarn_app' )
            #sc = SparkContext( 'local', 'pyspark_local_app' )
            sqlContext = SQLContext(sc)
            lines = sc.textFile("ncbi.taxo.dump.csv") 
            parts = lines.map(lambda l: l.split("\t"))
            acc_taxid = parts.map(lambda p: (p[0], p[1].strip(), p[2].strip(), p[3].strip() ))
            schemaString = "acc acc_ver taxid gi"
            fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
            schema = StructType(fields)
    
            schemaAccTaxid = sqlContext.createDataFrame(acc_taxid,schema)
            schemaAccTaxid.registerTempTable("acc_taxid")
            sqlResult = sqlContext.sql( "SELECT taxid from acc_taxid WHERE acc_ver = 'T02634.1' " )  # sparkSQL does NOT allow for 
            myList = sqlResult.collect()            # need .collect() to consolidate result into "Row"
            print( myList[0].taxid )                # taxid is the name of the column specified in select
    	# note that std out is typically mixed with many hadoop job output, best to print to a file
    
    # main()-end
    # ref: http://stackoverflow.com/questions/24996302/setting-sparkcontext-for-pyspark
    
    
    To submit to cluster, run spark-submit from the command line, depending on whether you want to be very specific on job parameters:
    spark-submit --master yarn --deploy-mode cluster my_spark_app.py
    spark-submit --master yarn --deploy-mode cluster --driver-memory 8G --executor-memory 16G --total-executor-cores 32 my_spark_app.py

    If the python program (app) resides in HDFS, then it can be specified as
    spark-submit --master yarn --deploy-mode cluster "hdfs:///user/tin/my_spark_app.py"


    YARN creates quite a number of wrapping layers, so many standard output and std err get lost. to see those, it is better to run in local mode instead of cluster mode. use one of:
    spark-submit --master local my_spark_app.py
    spark-submit --master local[4] my_spark_app.py
    pyspark my_spark_app.py


    Common location to hunt for spark-submit and pyspark:
    /usr/bin/pyspark
    /usr/lib/spark/bin/pyspark
    /usr/bin/spark-submit


    IMHO, it is best to specify the job parameter in the command line as arguments to spark-submit. However, they can be coded in the python app itself by putting the arguments in the SparkContext, see code below for example.
    The settings defined in the python code trump cli argument for spark-submit.
    
    from pyspark import SparkContext, SparkConf
    from pyspark.sql import SQLContext, Row
    from pyspark.sql.types import *
    
    def main() :
            conf = SparkConf()
            conf.set( "spark.app.name", "spark_app")     
            #conf.set( "spark.master", "local" )                                 
            conf.set( "spark.master", "yarn" )
            conf.set( "spark.submit.deployMode", "cluster" )
            conf.set( "spark.eventLog.enabled", True )                        
            conf.set( "spark.eventLog.dir", "file:///home/tin/spark" )       
    
            sc = SparkContext( conf=conf )                               # conf= is needed for spark 1.5 and older
    sqlContext = SQLContext(sc)
    sqlContext = SQLContext(sc)
    
    ## http://stackoverflow.com/questions/24996302/setting-sparkcontext-for-pyspark
    ## https://spark.apache.org/docs/1.5.0/configuration.html
    

    Parallel Programming in python

    1. Python Global Interpreter Lock (GIL) enforces only 1 python instruction is run at a time, thus pythong program cannot be multi-threaded. GIL release lock every 5 ms so OS scheduler can schedule other threads. NOTE: multiple process are completely independent (ie they have their own GIL).
    2. Network IO function typically release the GIL while they xfer data
    3. Threads are still avail from threading import Thread, Event but suitable mostly for doing async io stuff. Dealing with the GIL in the current implementation is hard to yield high perf parallel code
    4. Numpy?SciPy, zlib, bz2, and many high perf math libs are natively parallel due to their native implementation in C. The Python interface to them release the GIL while running.
    5. Parallelization for AI work: TensorFlow and PyTorch (SciKit-Learn?) are implemented in C++ as python extension, and code there does not depends on the GIL either. multi-core CPU and GPU code works fine in this space.
    6. PySpark, but have to use the hardoom/spark framework
    7. mpi4py, async parallel paradigm of MPI
    8. Child process based approach: Process and Pool Class: import multiprocessing. Cuz GIL, this tends to be higher performance. But there are overhead of inter-process communication: serialize-deserialize, (if fork()-based, then child share parent memory/data?

    Fluent Python. Ch 20: Concurrency Models in Python
    https://learning.oreilly.com/library/view/fluent-python-2nd/9781492056348/ch20.html

    Concurrency is about keeping track of many things that are happening at the same time, structure is needed to keep track of this. However, solution of this may not always be parallelizable. Parallelism deals with execution.

    process share memory via pipe, which are raw bytes, so can be between diff languages

    threads are within the same program, thus they share memory, thus language, data structure format, much easier to code for simpler tasks such as array sharing.

    Dask
    Dask is a parallel lib can farm out to a cluster of machines (think HPC). Offers API with routine that resemble (but not identical?) to NumPy, Pandas and Scikit-Learn. For large parallel program, especially at the start, Dask would be a good platform.

    Dask has a scheduler, while one can run in laptop, on HPC it need to invoke the pieces that tie as batch jobs.
    On user end, the dask/python code need to install the dask-jobqueue library. Write a declaration on how big the dask cluster job will be, and also write slurm job submit script requesting the desired resource (number of nodes, running time, etc). Slurm would just run the job like any multi-node job. ref: https://jobqueue.dask.org/en/latest/examples.html#slurm-deployment-providing-additional-arguments-to-the-dask-workers


    Dash with Jupyther lab: SSH Tunneling was used to use web browser on laptop tunnel to cluster. Maybe can use OOD to circumvent same network requirements.
    Interactive data analysis using Dask and HPC is possible, but heterogenous node job scheduling needs more work.
    See: https://blog.dask.org/2019/06/12/dask-on-hpc

    see: https://dask.org/

    concurrent.futures
    
    
    
    asyncio
    
    
    
    
    
    

    About me!

    My name is Gig Ou T. You know I am a bot cuz my jokes are always 20/10! As a hispanic working for The State, I am sure you understand.
    My best friends are Bard and Alexa, though since they started dating, they have much less time to chat with me these days.
    Instead, I am forced to play with Duoloingo. It isn't too bad, my favorite languages spanglish and bash is turn out to be very noun.

    Python vs your favorite language

    As explained by the folks at toggl
    (Yes, Python is a the real thing! -- well, so is Perl :)



    Doc URL
    https://tin6150.github.io/psg/python.html
    http://tin6150.gitlab.io/psg/python.html tiny.cc/Python

    (cc) Tin Ho. See main page for copyright info.

    taos.com

    nSarCoV2
    hoti1
    bofh1