Python Notes...

Page Contents

To Read

Useful Links

  1. A non-magical introduction to Pip and Virtualenv for Python beginners by Jamie Matthews.

Python Debugger: Winpdb

A really quite cute Python debugger, easy to use and GUI driver, is Winpdb:

Winpdb is a platform independent GPL Python debugger with support for multiple threads, namespace modification, embedded debugging, encrypted communication and is up to 20 times faster than pdb. Winpdb is being developed by Nir Aides since 2005.

Virtual Environments

http://docs.python-guide.org/en/latest/dev/virtualenvs/

PyLint: Linting Python Code

Run PyLint

Generally you can run pylint on a directory. But note that directory and subdirectories you want to check must have the __init__.py file in them, even if it is just empty.

To just run pylint individually on all your python files do this...

find . -name '*.py' | xargs pylint --rcfile=pylint_config_filename

Use the -rn option to suppress the summary tables at the end of the pylint output.

Usefully you can also use PyLint with PyEnchant to add spell checking to your comments, which can be pretty useful. To configure the dictionary to use just look up the [SPELLING] section in the PyLint RC file!

Message Format

The messages have the following format:

MESSAGE_TYPE: LINE_NUM:[OBJECT:] MESSAGE

The message type can be one of the following:

[R] means refactoring is recommended,
[C] means there was a code style violation,
[W] for warning about a minor issue,
[E] for error or potential bug,
[F] indicates that a fatal error occurred, blocking further analysis.

Configuring Pylint

If you want to apply blanket settings across many files use the --rcfile=<filename> switch. In the rcfile you can specify things like messages to supress at a global level, for example. This is much easier than trying to list everything you want to supress on the command line each time your run pyline.

To generate a template pylint rcfile use:

pylint --generate-rcfile

Inside the generated rcfile there are a few things that can be interesting. The most interesting is the init-hook which you can set, for example, to update the PYTHONPATH so that pylint can find all the imported modules:

[MASTER]
init-hook='import sys; sys.path.append(...);'

Note that the string is a one-liner python script.

Explain An Error Message

In an error message you will get, at the end of the message a string in parenthesis. For example you light see something like this:

C:289, 4: Missing method docstring (missing-docstring)
C:293, 4: Invalid method name "im_an_error" (invalid-name)

To get help on either of these errors, type:

pylint --help-msg=missing-docstring

Or...

pylint --help-msg=invalid-name

Suppressing Error Messages

To disable an error message for the entire file use --disable=msg-name. So, if you want to ignore all missing docstrings use --disable=missing-docstring.

Find all PyLint codes here. Or, you can use the command line "pylint --list-msgs" to list error messages and their corresponding codes.

To supress an error message for a specifc line of code, or for a block of code (put comment on first line of block start), use #pylint: disable=...

Longer/Different Allowed Function/Variable/etc Names

Sometimes I just want names longer than 30 characters. You could say that these names are too long, but then, esp. for functions, I find shortening the name makes it less meaningful or introduces abbreviations for things, which can make the code harder to read, esp. if the aabreviation isn't a standard/well-known one.

In your rcfile navigate to the section [BASIC]. Here you can edit the regular expressions that are used to validate things like functions names. E.g., I sometimes change:

function-rgx=[a-z_][a-z0-9_]{2,30}

To:

function-rgx=[a-z_][a-z0-9_]{2,40}

Use PyLint To Count LOC

Thanks to the author of, and comments made, for the following StackOverflow post.

Although LOC is not a good metric in the sense that many lines of bad code is still bad, to get a reasonable count of the lines of code (LOC) for all Python files contained in the current folder and all subfolders, use the following command.

find . -name '*.py' | xargs pylint 2>&1 | grep 'Raw metrics' -A 14

xargs takes the output of find and uses it to construct a parameter list that is passed to pylint. I.e. we get pylint to parse all files under our source tree. This output is passed to grep which searches for the "Raw Metrics" table heading and then outputs it along with the next 14 lines (due to the -A 14 option).

Flake8

Flake8 is another static analyser / PEP8 conformace checker to python. I have found that sometimes it finds things that pylint doesn't and vice versa, so hey, why not use both?!

To configure it with the equivalent of a pylint rcfile just create the file tox.ini or setup.cfg (I prefer the former as the latter is a little too generic) in the directory that you run flake8 from. This avoids having to use a global config file - you can have one per project this way. All the command line options that you would configure flake8 with become INI file style settings. For example, if you ran:

flake8 --ignore=E221 --max-line-length==100

This would become the following in the config file (note the file must have the header [flake8]:

[flake8]
ignore = E221
max-line-length = 100

Installing Python Libraries From Wheel Files

Python wheels are the new standard of python distribution. First make sure you have wheels installed:

pip install wheel

Once you have installed wheels you can download wheel files (*.whl) to anywhere on your computer and run the following:

pip install /path/to/your/wheel/file.whl

So, for example, when I wanted to install lxml on my Windows box, I went to Christoph Gohlke's Unofficial Windows Binaries for Python Extension Packages and downloaded the file lxml-3.6.4-cp27-cp27m-win_amd64.whl and typed the following:

pip install C:\Users\my_user_name\Downloads\lxml-3.6.4-cp27-cp27m-win_amd64.whl

Windows Python Module Installers: When Python Can't Be Found

It seems, when I install Windows Python from scratch that some installers will give the following error message:

python version 2.7 required, which was not found in the registry

The answer on how to overcome this is found in this SO thread, credits to the answer's author!

To summarise, Windows Python Installer created [HKEY_LOCAL_MACHINE\SOFTWARE\Python] and all the subkeys therein, but not [HKEY_CURRENT_USER\SOFTWARE\Python]. Oops! Easiest way to evercome this is to load regedit.exe and natigate to the [HKEY_LOCAL_MACHINE\SOFTWARE\Python]. Righ click on this entry and export it to a file of your choosing. Then edit the file to replace all occurrences of HKEY_LOCAL_MACHINE with HKEY_CURRENT_USER. Save it and double click it to install the Python info to the current user registery keys. Now the installers will run :)

For example, my registery file, after edit looked like this:

Windows Registry Editor Version 5.00

[HKEY_CURRENT_USER\SOFTWARE\Python]

[HKEY_CURRENT_USER\SOFTWARE\Python\PythonCore]

[HKEY_CURRENT_USER\SOFTWARE\Python\PythonCore\2.7]

[HKEY_CURRENT_USER\SOFTWARE\Python\PythonCore\2.7\Help]

[HKEY_CURRENT_USER\SOFTWARE\Python\PythonCore\2.7\Help\Main Python Documentation]
@="C:\\Python27\\Doc\\python2712.chm"

[HKEY_CURRENT_USER\SOFTWARE\Python\PythonCore\2.7\InstallPath]
@="C:\\Python27\\"

[HKEY_CURRENT_USER\SOFTWARE\Python\PythonCore\2.7\InstallPath\InstallGroup]
@="Python 2.7"

[HKEY_CURRENT_USER\SOFTWARE\Python\PythonCore\2.7\Modules]

[HKEY_CURRENT_USER\SOFTWARE\Python\PythonCore\2.7\PythonPath]
@="C:\\Python27\\Lib;C:\\Python27\\DLLs;C:\\Python27\\Lib\\lib-tk"

Python functions gotcha: default argument values - default value evaluated only once!

Ooh this one is interesting and is not at all how I intuitively imagined default values. I assumed that when a function parameter has a default value, that on every call to the function, the parameter is initialised with the default value. This is not the case, as I found [Ref]! The default value is evaluated only once and acts like a static variable in a C function after that! The following example is taken from the Python docs on functions:

def f(a, L=[]):
    # Caution! You might not expect it but this function accumulates
    # the arguments passed to it on subsequent calls
    L.append(a)
    return L

print(f(1))  # Prints [1]
print(f(2))  # Prints [1, 2], not [2] as you might expect!

This is summarised in the docs...

The default value is evaluated only once. This makes a difference when the default is a mutable object such as a list, dictionary, or instances of most classes ... [because] the default ... [will] be shared between subsequent calls ...

Python Binding (vs C++11 Binding)

Python lambda's bind late (are lazily bound) [Ref]. This means the the following code will have the output shown:

x = 1
f = lambda y: y * x
print f(2)
x = 10
print f(2)
# Outputs:
# 2
# 20

I.e., the value of x is looked up in the surrounding scope when the function is called and not when the expression creating the lambda is evaluated. This means that in the statement f = lambda y: y * x, the variable x is not evaluated straight away. It is delayed until x is actually needed. Hence the two different values are output when we call f(2) with the same parameter value.

We can go to the Python docs for further information:

A block is a piece of Python program text that is executed as a unit. The following are blocks:

  • A module,
  • A function body,
  • A class,
  • A script file,
  • ...

... When a name is used in a code block, it is resolved using the nearest enclosing scope. The set of all such scopes visible to a code block is called the block’s environment ...

If a name is bound in a block, it is a local variable of that block. ... If a variable is used in a code block but not defined there, it is a free variable.

So, we can see that in the lambda expression above, the variable x is a free variable. So, it is resolved using the nearest enclosing scope. The nearest enclosing scope in the above example happens to be the global scope.

This is the same example as you find in many classic examples [Ref], replicated here:

def create_multipliers():
   return [lambda x : i * x for i in range(5)]

for multiplier in create_multipliers():
   print multiplier(2)

# Outputs
# 8 8 8 8 (newlines removed for brievity)

Why does it output 8's? Because the list comprehension creates a list of lambda functions, in which the variable i has not yet been evaluated. By the time we come to evaluate the lambda i is set to 4. How is i evaluated? As it is a free variable in the lambda it is "resolved using the nearest enclosing scope". The nearest enclosing scope is the function body of create_multiplies() (because a list comprehesion is not a block, so i is bound in create_multiplies()). By the time create_multiplies() exits, i is 4, but because the lambda closes this scope, every time i is looked up, it is 4, because the lookup of i does not occur until later, when the lambdas are actually evaluated.

I.e., create_multipliers() is called. This creates a list of 4 lambda functions:

[lambda1, lambda2, lambda3, lambda4]

Each lambdaX has not yet been evaluated, so by the time this list has been create, the variable i has the value 4. Later, when any of the lambda functions are called, i is evaluated so Python searches down the scope chain until it finds the first i, which it does and in this case it has the value 4!

Note, that this is a little different in C++. In C++ (C++11 or greater) however, you would have to pass x by reference to get the same result. If we transliterate the first example to C++ we get:

#include <iostream>

int main(int argc, char* argv[])
{
	int x = 1;
	auto myLambda = [x](int y) {
		return x * y;
	};
	std::cout << myLambda(2) << "\n";
	x = 10;
	std::cout << myLambda(2) << "\n";
	return 0;
}
// Prints:
// 2
// 2 (notice here Python would print 20!

To get the behaviour of Python we have to do the following:

auto myLambda = [&x](int y) {
		return x * y;
	};

Note the ampersand added before x so that the outer scope is passed to the lambda by reference, not value!

Infinite recursion in __setattr__() & __getattr__() in Python

The recursion problem

In most, if not all, of the little tutorials I used to learn about __setattr__() and __getattr__() seemed either to treat them independently, in otherwords, the example classes had one or the other defined but not both, or used both but had very simple use cases. Then as I started to play with them, in my newbie-to-python state, I did the following (abstracted out into a test case). This also serves as a little Python __setattr__ example and a Python __getattr__ example...

class Test(object):
	def __init__(self):
		self._somePrivate = 1

	def __getattr__(self, name):
		print "# GETTING %s"  % (name)
		if self._somePrivate == 2:
			pass
		return "Test attribute"

	def __setattr__(self, name, val):
		print "# SETTING %s" % (name)
		if self._somePrivate == 2:
			pass
		super(Test, self).__setattr__(name, val)

t = Test()
print t.someAttr

Running this causes a the maximum recursion depth to be reached:

$ python test1.py
# SETTING _somePrivate
# GETTING _somePrivate
...<snip>...
# GETTING _somePrivate
Traceback (most recent call last):
  File "test1.py", line 17, in 
    t = Test()
  File "test1.py", line 3, in __init__
    self._somePrivate = 1
  File "test1.py", line 13, in __setattr__
    if self._somePrivate == 2:
  File "test1.py", line 7, in __getattr__
    if self._somePrivate == 2:
  ...<snip>...
  File "test1.py", line 7, in __getattr__
    if self._somePrivate == 2:
RuntimeError: maximum recursion depth exceeded

As I had read up on the subject it was clear that one can't set an attribute in __setattr__() because that would just cause __setattr__() to be called again resulting in infinite recursion (until the stack blows up!). The solution (in "new" style classes which derive from object) is to call the parent's __setattr__() method. As for __getattr__(), from the documentation it was also clear that "...if the attribute is found through the normal mechanism, __getattr__() is not called...".

So, I thought that was all my recursion problems sorted out. Also, if you delete either the __getattr__() or __setattr__() from the above example, it works correctly. So for example...

class Test2(object):
	def __init__(self):
		self._somePrivate = 1

	def __getattr__(self, name):
		print "# GETTING %s"  % (name)
		if self._somePrivate == 2:
			pass
		return "Test attribute"

t = Test2()
print t.someAttr

... the above test program works as expected and outputs the following.

# GETTING someAttr
Test attribute

So, what is it about the first example that causes the infinite recursion? The first problem is this little line in the constructor...

self._somePrivate = 1

At this point in the constructor, variable self._somePrivate does not yet exist. When __setattr__() is called the first thing it will does is to query self._somePrivate...

	def __setattr__(self, name, val):
		if self._somePrivate == 2: # -- Oops --

This means that __getattr__() must be called to resolve self._somePrivate because the variable does not yet exist and therefore cannot be "...found through the normal mechanism...". And here is the flaw... my initial assumpton was that this would work because __getattr__() is only called if the attribute can't otherwise be found, and I thought it would be found.

But of course, it cannot be found, so __getattr__() also has to be called. Then, __getattr__() tries to access the variable self._somePrivate and because it still does not exist, __getattr__() is called again, and again, and so on... resulting in the infinite recursion seen.

And from this we can understand why the second example worked. Because there is no __setattr__() defined in the second test class, the method does not try to read the variable first (as my little example did) and so __getattr__() need never be called. Therefore the variable is created successfull upon class initialisation and any subsequent queries on the variable will be found using the normal mechanism. Even if the second example had defined __setattr__(), as long as it did not try to read self._somePrivate, it would have been okay.

So the moral of this little story was, if implementing either of these magic methods, be careful which variables you access as part of the get/set logic!

I needed to do this however, so what can be done to resolve this. The solution is to define the constructor as follows, using exactly the same type of set we used in __setattr__() to avoid the recursion problem:

 class Test(object):
	def __init__(self):
		super(Test, self).__setattr__('_somePrivate', 1)

Now the example works again... yay!

Setting the value of a class instance array

Another thing I had been doing was to set an element of an array in the __setattr__() function and a kind chappy on StackOverflow answered my question which I'll duplicate below. In the example below the line self._someAttr = 1 behaves as I'd have expected by getting __setattr__() to recurse, only the once, back into itself. What I didn't understand was why the line self._rowData[Test.tableInfo[self._tableName][name]] = val didn't do the same. I was thinking that to set the array we'd call __setattr__() again, but it doesn't. The test example is shown below.

class Test(object):
    tableInfo = { 'table1' : {'col1' : 0, 'col2':1} }

    def __init__(self, tableName):
        super(Test, self).__setattr__('_tableName', tableName) # Must be set this way to stop infinite recursion as attribute is accessed in bot set and get attr
        self._rowData = [123, 456]

    def __getattr__(self, name):
        print "# GETTING %s"  % (name)
        assert self._tableName in Test.tableInfo

        if name in Test.tableInfo[self._tableName]:
            return self._rowData[Test.tableInfo[self._tableName][name]]
        else:
            raise AttributeError()

    def __setattr__(self, name, val):
        print "# SETTING %s" % (name)
        if name in Test.tableInfo[self._tableName]:
            print "Table column name found"
            self._rowData[Test.tableInfo[self._tableName][name]] = val
            self._someAttr = 1
        else:
            super(Test, self).__setattr__(name, val)

class Table1(Test):
    def __init__(self, *args, **kwargs):
        super(Table1, self).__init__("table1", *args, **kwargs)

t = Table1()
print t.col1
print t.col2
t.col1 = 999
print t.col1

It produces the following output...

$ python test.py
# SETTING _rowData
# GETTING col1
123
# GETTING col2
456
# SETTING col1
Table column name found
# SETTING _someAttr
# GETTING col1
999

So, why didn't the recursion occur for self._rowData[Test.tableInfo[self._tableName][name]] = val? I had thought we'd have to call __setattr__() again to set this. As the SO user "filmor" explained, the following happens:

self._rowData[bla] = val gets resolved to self.__getattr__("_rowData")[bla] = val. So we get the array (it already exists so is found by the normal mechanisms and not via another call to __getattr__(). But then to set an array value __setitem__() is used an not __setattr__(). So, the expression resolves to self.__getattribute__("_rowData").__setitem__(bla, val) and there is therefore no further __setattr__() called. Simples!

Concatenating immutable sequences more quickly in Python

PyDoc for immutable sequences says:

Concatenating immutable sequences always results in a new object. This means that building up a sequence by repeated concatenation will have a quadratic runtime cost in the total sequence length. To get a linear runtime cost ... build a list and use .join()

Interesting... I've been building up SQL strings using concatenation. Is using a join really better? Lets have a look... In my simple test below I create a large list of strings and concatenate them using string concatenation in test1 and list.join() in test2.

def test1(stringList):
	s = ""
	for i in stringList:
		s += "{}, ".format(i)

def test2(stringList):
	s = ", ".join(stringList)

if __name__ == '__main__':
	import timeit
	print(timeit.timeit("test1(map(lambda x: str(x), range(0,1000)))",
	                    setup="from __main__ import test1", number=10000))
	print(timeit.timeit("test2(map(lambda x: str(x), range(0,1000)))",
	                    setup="from __main__ import test2", number=10000))

All the "map(lambda x: str(x), range(0,1000)" expression does is to create a list of 1000 strings to concatenate so that each test function is concatentating a list of the same strings.

On my system (it will be different on yours) I get the following output from the test program.

5.61275982857
2.88877487183

So joining a list of strings is faster than concatenating strings by approximately 50%.

Reading Excel Files in Python

Worth having a look at python-excel...

Reads Excel Files Using XLRD

Apparently only good for reading data and formatting information from older Excel files (ie: .xls) but I seem to be using it fine on xlsx files...

To load the module use the following.

import xlrd

Open workbooks and worksheets as follows.

workbook  = xlrd.open_workbook(xlsFile)
worksheet = workbook.sheet_by_name('my-worksheet-name')

Iterate through rows and access columns:

for rowIdx in range(worksheet.nrows):
   row = worksheet.row(rowIdx)
   col1_value = row[1].value
   ...

Deal with dates using xldate_as_tuple. It will convert an Excel date into a tuple (year, month, day, hour, minute, nearest_second). When using this function remember to use the datemode workbook.datemode to use the correct date/time zone settings used in the spreadsheet.

dateColIdx = 1
rawdate = xlrd.xldate_as_tuple(row[dateColIdx].value, workbook.datemode)
print time.strftime('%Y-%m-%d', rawdate + (0,0,0))

Read Excel Files Using Pandas

Note that Pandas is zero indexed, whereas excel is 1 indexed.

import pandas
pandas.read_excel(xlsxFileName, worksheetName, header=excel_header_row_number)

Finding Index Of Closest Value To X In A List In Python

If a list is unsorted, to find the closest value one would iterate through the list. At each index the distance from the value at that index to the target value is measured and if it is less than the least distance seen so far that index is recorded.

That's basically one for loop with a few tracking variables... O(n) operation. But, for loops aren't really very Pythonic in many ways and half to point of having a vectorized library like numpy is that we avoid that tedium.

This is why, when I saw this solution to the problem, I though "ooh that's clever"...

findClosestIndex = lambda vec, val: numpy.arange(0,len(vec))[abs(vec-val)==min(abs(vec-val))][0]
closestIndex     = findClosestIndex(array_to_search, value_to_find_closest_to)

It's also a very terse read! So, let's break it down. The lambda expression is equivalent to the following.

def findClosestIndex(vec, val):
   # Pre: vec is a numpy.array, val is a numeric datatype
   vecIndicies = np.arange(len(vec))

   # produces the distance of each array value from "val".
   distanceFromVal = abs(vec-val)

   # the smallest distance found.
   minDistance = min(distanceFromVal)

   # Produce a boolean index to the distance array selecting only those distances
   # that are equal to the minimum.
   vecIndiciesFilter = distanceFromVal == minDistance

   # vecIndicies[vecIndiciesFilter] is an array where each element is the index
   # of an element in vec which equals val.
   return vecIndicies[vecIndiciesFilter][0]  

The line vecIndicies = np.arange(len(vec)) produces an array that is exactly the same size as the array vec where vecIndicies[i] == i.

The line distanceFromVal = abs(vec-val) produces an array where distanceFromVal[i] == |vec[i] - val|. In otherwords each element in distanceFromVal corresponds to the distance of the same element in vec from the value we are searching for, val.

The next line...

The next line produces an array vecIndiciesFilter where each element, vecIndiciesFilter[i], is True if distanceFromVal[i] == minDistance

TODO... incomplete, needs finishing with SO better method and speed comparisons.

Drop Into The Python Interpretter

import code
code.interact(local=locals())

Working With Files In Python

Check If a File Or Directory Exists

import os.path
if os.path.isfile(fname):
   print "Found the file"

if os.path.isdir(dirname):
   print "Found the directory"

Traversing Through Directories For Files

To find all files matching a certain pattern in a selected directory and all of its subdirectories, using something like the following can work quite well...

def YieldFiles(dirToScan, mask):
   for rootDir, subDirs, files in os.walk(dirToScan):
      for fname in files:
         if fnmatch.fnmatch(fname, mask):
            yield (rootDir, fname)

# Find all .txt files under /some/dir
for dir, file in YieldFiles("/some/dir", "*.txt")
   print file

The above searches from parent directory to children in a recursive descent, i.e, top-down fashion. If you want to search bottom-up then add the flag topdown=True to the os.walk() function.

Deleting Files and Directories (Recursively)

The Python library shutils has plenty of functions for doing this. For example, if you want to remove a temporary directory and all files and subdirectories within...

if os.path.exists(cacheDir):
   shutil.rmtree(cacheDir) # Recursively delete dir and contents
   os.makedirs(cacheDir)   # Recreate dir (recursively if
                           # intermediete dirs dont exist)

However, you may sometimes run into problems on windows when deleting files or directories. This is normally a permissions issue. Also, although this seems silly, you won't be able to delete a directory if your current working directory is set to that directory or one of its children.


TODO
Stat and os.walk in opposite direction
  https://docs.python.org/2/library/os.html
  https://docs.python.org/2/library/stat.html
  http://stackoverflow.com/questions/2656322/python-shutil-rmtree-fails-on-windows-with-access-is-denied
---
  import fnmatch
  import os
  for root, dirs, files in os.walk("/some/dir"):
     for fname in files:
        if fnmatch.fnmatch(file, '*.txt'):
           pass
---

Script dir
  http://stackoverflow.com/questions/4934806/how-can-i-find-scripts-directory-with-python

Print literal {}
  https://docs.python.org/2/library/stat.html
  also formatting from my little debug output class

Get Hostname
   https://wiki.python.org/moin/Powerful%20Python%20One-Liners/Hostname

python get environment variable
import os
os.environ['A_VARIABLE'] = '1'
print os.environ['A_VARIABLE']   ## Key must exist!

To not care if key exists use
print os.environ.get('A_VAR') # Returns None of key doesn't exist

platform.system()
   https://docs.python.org/2/library/platform.html

flush stdout
   import sys
   sys.stdout.flush()

printing on windows. can't remember where I got this... some SO thread, needs references!

   import tempfile
   import win32api
   import win32print

   filename = tempfile.mktemp (".txt")
   open (filename, "w").write ("This is a test")
   win32api.ShellExecute (
        0,
        "print",
        filename,
        #
        # If this is None, the default printer will
        # be used anyway.
        #
        '/d:"%s"' % win32print.GetDefaultPrinter (),
        ".",
        0
   )