PyPop.ParseFile#

Module for parsing data files.

Includes classes for parsing individuals genotyped at multiple loci and classes for parsing literature data which only includes allele counts.

Classes#

ParseFile

Abstract class for parsing a datafile.

ParseGenotypeFile

Class to parse standard datafile in genotype form.

ParseAlleleCountFile

Class to parse datafile in allele count form.

Module Contents#

class ParseFile(filename, validPopFields=None, validSampleFields=None, separator='\t', fieldPairDesignator='_1:_2', alleleDesignator='*', popNameDesignator='+', debug=0)#

Abstract class for parsing a datafile.

Not to be instantiated.

Constructor for ParseFile object.

  • ‘filename’: filename for the file to be parsed.

  • ‘validPopFields’: a string consisting of valid headers (one

    per line) for overall population data (no default)

  • ‘validSampleFields’: a string consisting of valid headers

    (one per line) for lines of sample data. (no default)

  • ‘separator’: separator for adjacent fields (default: a tab

    stop, ‘t’).

  • ‘fieldPairDesignator’: a string which consists of additions to the allele `stem’ for fields grouped in pairs (allele fields) [e.g. for `HLA-A’, and `HLA-A(2)’, then we use ‘:(2)’, for `DQA1_1’ and `DQA1_2’, then use use ‘_1:_2’, the latter case distinguishes both fields from the stem] (default: ‘:(2)’)

  • ‘alleleDesignator’: The first character of the key which

determines whether this column contains allele data. Defaults to ‘*’

  • ‘popNameDesignator’: The first character of the key which

determines whether this column contains the population name. Defaults to ‘+’

  • ‘debug’: Switches debugging on if set to ‘1’ (default: no debugging, ‘0’)

getPopData()#

Returns a dictionary of population data.

Dictionary is keyed by types specified in population metadata file

getSampleMap()#

Returns dictionary of sample data.

Each dictionary position contains either a 2-tuple of column position or a single column position keyed by field originally specified in sample metadata file

getFileData()#

Returns file data.

Returns a 2-tuple `wrapper’:

  • raw sample lines, without header metadata.

  • the field separator.

genSampleOutput(fieldList)#

Prints the data specified in ordered field list.

Use is currently deprecated.

serializeMetadataTo(stream)#
class ParseGenotypeFile(filename, untypedAllele='****', **kw)#

Bases: ParseFile

Inheritance diagram of PyPop.ParseFile.ParseGenotypeFile

Class to parse standard datafile in genotype form.

Constructor for ParseGenotypeFile.

  • ‘filename’: filename for the file to be parsed.

In addition to the arguments for the base class, this class accepts the following additional keywords:

  • ‘untypedAllele’: The designator for an untyped locus. Defaults

to ‘****’.

genValidKey(field, fieldList)#

Check and validate key.

  • ‘field’: string with field name.

  • ‘fieldList’: a dictionary of valid fields.

Check to see whether ‘field’ is a valid key, and generate the appropriate ‘key’. Returns a 2-tuple consisting of ‘isValidKey’ boolean and the ‘key’.

Note: this is explicitly done in the subclass of the abstract ‘ParseFile’ class (i.e. since this subclass should have `knowledge’ about the nature of fields, but the abstract class should not have)

getMatrix()#

Returns the genotype data.

Returns the genotype data in a ‘StringMatrix’ NumPy array.

serializeSubclassMetadataTo(stream)#

Serialize subclass-specific metadata.

class ParseAlleleCountFile(filename, **kw)#

Bases: ParseFile

Inheritance diagram of PyPop.ParseFile.ParseAlleleCountFile

Class to parse datafile in allele count form.

Currently only handles one locus per population, in format:

<metadata-line1> <metadata-line2> DQA1 count 0102 20 0103 33 …

Currently a prototype implementation.

Constructor for ParseFile object.

  • ‘filename’: filename for the file to be parsed.

  • ‘validPopFields’: a string consisting of valid headers (one

    per line) for overall population data (no default)

  • ‘validSampleFields’: a string consisting of valid headers

    (one per line) for lines of sample data. (no default)

  • ‘separator’: separator for adjacent fields (default: a tab

    stop, ‘t’).

  • ‘fieldPairDesignator’: a string which consists of additions to the allele `stem’ for fields grouped in pairs (allele fields) [e.g. for `HLA-A’, and `HLA-A(2)’, then we use ‘:(2)’, for `DQA1_1’ and `DQA1_2’, then use use ‘_1:_2’, the latter case distinguishes both fields from the stem] (default: ‘:(2)’)

  • ‘alleleDesignator’: The first character of the key which

determines whether this column contains allele data. Defaults to ‘*’

  • ‘popNameDesignator’: The first character of the key which

determines whether this column contains the population name. Defaults to ‘+’

  • ‘debug’: Switches debugging on if set to ‘1’ (default: no debugging, ‘0’)

genValidKey(field, fieldList)#

Checks to see validity of a field.

Given a ‘field’, this is checked against the ‘fieldList’ and a tuple of a boolean (key is valid) and a a key is returned.

The first element in the ‘fieldList’ which is a locus name, can match one of many loci (delimited by colons ‘:’). E.g. it may look like:

‘DQA1:DRA:DQB1’

If the field in the input file match any of these keys, return the field and a valid match.

serializeSubclassMetadataTo(stream)#
getAlleleTable()#
getLocusName()#
getMatrix()#

Returns the genotype data.

Returns the genotype data in a ‘StringMatrix’ NumPy array.