PyPop.utils#

Module for common utility classes and functions.

Contains convenience classes for output of text and XML files.

Attributes#

GENOTYPE_SEPARATOR

Separator between genotypes

GENOTYPE_TERMINATOR

Terminator of genotypes

Classes#

TextOutputStream

Output stream for writing text files.

XMLOutputStream

Output stream for writing XML files.

StringMatrix

Matrix of strings and other metadata from input file to PyPop.

Group

Group list or sequence into non-overlapping chunks.

Functions#

critical_exit(message, *args)

Log a CRITICAL message and exit with status 1.

getStreamType(stream)

Get the type of stream.

glob_with_pathlib(pattern)

Use globbing with pathlib.

natural_sort_key(s[, _nsre])

Generate a key for natural (human-friendly) sorting.

unique_elements(li)

Gets the unique elements in a list.

appendTo2dList(aList[, appendStr])

Append a string to each element in a list.

convertLineEndings(file, mode)

Convert line endings based on platform.

fixForPlatform(filename[, txt_ext])

Fix for some Windws/MS-DOS platforms.

copyfileCustomPlatform(src, dest[, txt_ext])

Copy file to file with fixes.

copyCustomPlatform(file, dist_dir[, txt_ext])

Copy file to directory with fixes.

checkXSLFile(xslFilename[, path, subdir, abort, msg])

Check XSL filename and return full path.

getUserFilenameInput(prompt, filename)

Get user filename input.

splitIntoNGroups(alist[, n])

Divides a list up into n parcels (plus whatever is left over).

Module Contents#

GENOTYPE_SEPARATOR = '~'#

Separator between genotypes

Example

In a haplotype 01:01~13:01~04:02

GENOTYPE_TERMINATOR = '~'#

Terminator of genotypes

Example

`02:01:01:01~

class TextOutputStream(file)#

Output stream for writing text files.

Parameters:

file (file) – file handle

write(str)#

Write to stream.

Parameters:

str (str) – string to write

writeln(str='\n')#

Write a newline to stream.

Parameters:

str (str, optional) – defaults to newline

close()#

Close stream.

flush()#

Flush to disk.

class XMLOutputStream(file)#

Bases: TextOutputStream

Inheritance diagram of PyPop.utils.XMLOutputStream

Output stream for writing XML files.

opentag(tagname, **kw)#

Write an open XML tag to stream.

Tag attributes passed as optional named keyword arguments.

Example

opentag('tagname', role=something, id=else)

produces the result:

<tagname role="something" id="else">

Attribute and values are optional:

opentag('tagname')

Produces:

<tagname>

See also

Must be be followed by a closetag().

Parameters:

tagname (str) – name of XML tag

emptytag(tagname, **kw)#

Write an empty XML tag to stream.

This follows the same syntax as opentag() but without XML content (but can contain attributes).

Example

`emptytag('tagname', attr='val')

produces:

<tagname attr="val"/>

Parameters:

tagname (str) – name of XML tag

closetag(tagname)#

Write a closing XML tag to stream.

Example

closetag('tagname')

Generate a tag in the form:

</tagname>

See also

Must be be preceded by a opentag().

Parameters:

tagname (str) – name of XML tag

tagContents(tagname, content, **kw)#

Write XML tags around contents to a stream.

Example

tagContents('tagname', 'foo bar')

produces:

<tagname>foo bar</tagname>`

Parameters:
  • tagname (str) – name of XML tag

  • content (str) – must only be a string. &, < and > are converted into valid XML equivalents.

class StringMatrix(rowCount=None, colList=None, extraList=None, colSep='\t', headerLines=None)#

Bases: collections.abc.Sequence

Inheritance diagram of PyPop.utils.StringMatrix

Matrix of strings and other metadata from input file to PyPop.

StringMatrix is a subclass of collections.abc.Sequence and represents genotype or locus-based data in a row-oriented matrix structure with NumPy-style indexing and sequence semantics. Rows correspond to individuals, and columns correspond to loci.

The object supports indexing, assignment, copying, and printing using standard Python and NumPy idioms.

Parameters:
  • rowCount (int) – number of rows in matrix

  • colList (list) – list of locus keys in a specified order

  • extraList (list) – other non-matrix metadata

  • colSep (str) – column separator

  • headerLines (list) – list of lines in the header of original file

Note

  • len(matrix) returns the number of rows.

  • Indexing retrieves data by locus or locus combinations.

  • Assignment updates genotype or metadata values in place.

  • Slicing over rows (e.g., matrix[i:j]) is not currently supported.

  • Deep copying produces a fully independent matrix.

Examples

Create a matrix of two individuals with two loci and assign genotype data:

>>> matrix = StringMatrix(2, ["A", "B"])
>>> matrix [0, "A"] = ("A0_1", "A0_2")
>>> matrix [1, "A"] = ("A1_1", "A1_2")
>>> matrix [0, "B"] = ("B0_1", "B0_2")
>>> matrix [1, "B"] = ("B1_1", "B1_2")

Length of matrix is defined as the number of individuals in the matrix:

>>> len(matrix)
2

Retrieve data for a single locus:

>>> matrix["A"]
[['A0_1', 'A0_2'], ['A1_1', 'A1_2']]

String representation:

>>> print (matrix)
StringMatrix([['A0_1', 'A0_2', 'B0_1', 'B0_2'],
       ['A1_1', 'A1_2', 'B1_1', 'B1_2']], dtype=object)

Copying the matrix:

>>> import copy
>>> m2 = copy.deepcopy(matrix)
>>> m2 is matrix
False
__repr__()#

Override default representation.

Returns:

new string representation

Return type:

str

__len__()#

Get number of rows (individuals) in the matrix.

This allows StringMatrix instances to be used with len(), iteration, and other Python sequence protocols.

Returns:

number of rows in the matrix

Return type:

int

__deepcopy__(memo)#

Create a deepcopy for copy.deepcopy.

This simply calls self.copy() to allow copy.deepcopy(matrixInstance) to work out of the box.

Parameters:

memo (dict) – opaque object

Returns:

copy of the matrix

Return type:

StringMatrix

__getslice__(i, j)#

Get slice (overrides built-in).

Warning

Currently not supported for StringMatrix

__getitem__(key)#

Get the item at given key (overrides built-in numpy).

Parameters:

key (str) – locus key

Returns:

a list (a single column vector if only one position specified), or list of lists: (a set of column vectors if several positions specified) of tuples for key

Return type:

list

Raises:

KeyError – if key is not found, or of wrong type

__setitem__(index, value)#

Set the value at an index (override built in).

Parameters:
  • index (tuple) – index into matrix

  • value (tuple|str) – can set using a tuple of strings, or a single string (for metadata)

Raises:
dump(locus=None, stream=sys.stdout)#

Write file to a stream in original format.

Parameters:
copy()#

Make a (deep) copy.

Returns:

a deep copy of the current object

Return type:

StringMatrix

getNewStringMatrix(key)#

Create new StringMatrix containing specified loci.

Note

The format of the keys is identical to __getitem__() except that it returns a full StringMatrix instance which includes all metadata

Parameters:

key (str) – a string representing the loci, using the locus1:locus2 format

Returns:

full instance

Return type:

StringMatrix

Raises:

KeyError – if locus can not be found.

getUniqueAlleles(key)#

Get naturally sorted list of unique alleles.

Parameters:

key (str) – loci to get

Returns:

list of unique integers sorted by allele name using natural sort

Return type:

list

convertToInts()#

Convert the matrix to integers.

Note

This function is used by the PyPop.haplo.Haplostats class. Note that integers start at 1 for compatibility with haplo-stats module

Returns:

matrix where the original allele names are now represented by integers

Return type:

StringMatrix

countPairs()#

Count all possible pairs of haplotypes for each matrix row.

Warning

This does not do any involved handling of missing data as per geno.count.pairs from R haplo.stats module.

Returns:

each element is the number of pairs in row order

Return type:

list

flattenCols()#

Flatten columns into a single list.

Important

Currently assumes entries are integers.

Returns:

all alleles, the two genotype columns concatenated for each locus

Return type:

list

filterOut(key, blankDesignator)#

Get matrix rows filtered by a designator.

Parameters:
  • key (str) – locus to filter

  • blankDesignator (str) – string to exclude

Returns:

the rows of the matrix that do not contain blankDesignator at any rows

Return type:

list

getSuperType(key)#

Get a matrix grouped by specified key.

Example

Return a new matrix with the column vector with the alleles for each genotype concatenated like so:

>>> matrix = StringMatrix(2, ["A", "B"])
>>> matrix[0, "A"] = ("A01", "A02")
>>> matrix[1, "A"] = ("A11", "A12")
>>> matrix[0, "B"] = ("B01", "B02")
>>> matrix[1, "B"] = ("B11", "B12")
>>> print(matrix)
StringMatrix([['A01', 'A02', 'B01', 'B02'],
       ['A11', 'A12', 'B11', 'B12']], dtype=object)
>>> matrix.getSuperType("A:B")
StringMatrix([['A01:B01', 'A02:B02'],
       ['A11:B11', 'A12:B12']], dtype=object)
Parameters:

key (str) – loci to group

Returns:

a new matrix with the columns concatenated

Return type:

StringMatrix

class Group(li, size)#

Group list or sequence into non-overlapping chunks.

Example

>>> for pair in Group('aabbccddee', 2):
...    print(pair)
...
aa
bb
cc
dd
ee
>>> a = Group('aabbccddee', 2)
>>> a[0]
'aa'
>>> a[3]
'dd'
Parameters:
  • li (str|list) – string or list

  • size (int) – size of grouping

__getitem__(group)#

Get the item by position.

Parameters:

group (int) – get the item by position

Returns:

the value at that position

Return type:

str|list

Raises:

IndexError – if group is out of bounds

critical_exit(message, *args)#

Log a CRITICAL message and exit with status 1.

Added in version 1.4.0.

Parameters:

message (str) – Logging format string.

getStreamType(stream)#

Get the type of stream.

Parameters:

stream (TextOutputStream|XMLOutputStream) – stream to check

Returns:

either xml or text.

Return type:

string

glob_with_pathlib(pattern)#

Use globbing with pathlib.

Parameters:

pattern (str) – globbing pattern

Returns:

of pathlib globs

Return type:

list

natural_sort_key(s, _nsre=re.compile('([0-9]+)'))#

Generate a key for natural (human-friendly) sorting.

This function splits a string into text and number components so that numbers are compared by value instead of lexicographically. It is intended for use as the key function in list.sort() or sorted().

Example

>>> items = ["item2", "item10", "item1"]
>>> sorted(items, key=natural_sort_key)
['item1', 'item2', 'item10']
Parameters:
  • s (str) – The string to split into text and number components.

  • _nsre (Pattern) – Precompiled regular expression used internally to split the string into digit and non-digit chunks. This is not intended to be overridden in normal use.

Returns:

A list of strings and integers to be used as a sort key.

Return type:

list

unique_elements(li)#

Gets the unique elements in a list.

Parameters:

li (list) – a list

Returns:

unique elements

Return type:

list

appendTo2dList(aList, appendStr=':')#

Append a string to each element in a list.

Parameters:
  • aList (list) – list to append to

  • appendStr (str) – string to append

Returns:

a list with string appended to each element

Return type:

list

convertLineEndings(file, mode)#

Convert line endings based on platform.

Parameters:
  • file (str) – file name to convert

  • mode (int) –

    Conversion mode, one of

    • 1 Unix to Mac

    • 2 Unix to DOS

fixForPlatform(filename, txt_ext=0)#

Fix for some Windws/MS-DOS platforms.

Parameters:
  • filename (str) – path to file

  • txt_ext (int, optional) – if enabled (1) add a .txt extension

copyfileCustomPlatform(src, dest, txt_ext=0)#

Copy file to file with fixes.

Parameters:
  • src (str) – source file

  • dest (str) – source file

  • txt_ext (int, optional) – if enabled (1) add a .txt extension

copyCustomPlatform(file, dist_dir, txt_ext=0)#

Copy file to directory with fixes.

Parameters:
  • file (str) – source file

  • dist_dir (str) – source directory

  • txt_ext (int, optional) – if enabled (1) add a .txt extension

checkXSLFile(xslFilename, path='', subdir='', abort=False, msg='')#

Check XSL filename and return full path.

Parameters:
  • xslFilename (str) – name of the XSL file

  • path (str) – root path to check

  • subdir (str) – subdirectory under path to check

  • abort (bool) – if enabled (True) file isn’t found, exit with an error. Default is False

  • msg (str) – output message on abort

Returns:

checked and validaated path

Return type:

str

getUserFilenameInput(prompt, filename)#

Get user filename input.

Read user input for a filename, check its existence, continue requesting input until a valid filename is entered.

Parameters:
  • prompt (str) – description of file

  • filename (str) – default filename

Returns:

name of file eventually selected

Return type:

str

splitIntoNGroups(alist, n=1)#

Divides a list up into n parcels (plus whatever is left over).

Example

>>> a = ['A', 'B', 'C', 'D', 'E']
>>> splitIntoNGroups(a, 2)
[['A', 'B'], ['C', 'D'], ['E']]
Parameters:
  • alist (list) – list to divide up

  • n (int) – parcel size

Returns:

list of lists

Return type:

list