PyPop.ParseFile#
Module for parsing data files.
Includes classes for parsing individuals genotyped at multiple loci and classes for parsing literature data which only includes allele counts.
Classes#
Abstract class for parsing a datafile. |
|
Class to parse standard datafile in genotype form. |
|
Class to parse datafile in allele count form. |
Module Contents#
- class ParseFile(filename, validPopFields=None, validSampleFields=None, separator='\t', fieldPairDesignator='_1:_2', alleleDesignator='*', popNameDesignator='+', debug=0)#
Abstract class for parsing a datafile.
Not to be instantiated.
Constructor for ParseFile object.
‘filename’: filename for the file to be parsed.
- ‘validPopFields’: a string consisting of valid headers (one
per line) for overall population data (no default)
- ‘validSampleFields’: a string consisting of valid headers
(one per line) for lines of sample data. (no default)
- ‘separator’: separator for adjacent fields (default: a tab
stop, ‘t’).
‘fieldPairDesignator’: a string which consists of additions to the allele `stem’ for fields grouped in pairs (allele fields) [e.g. for `HLA-A’, and `HLA-A(2)’, then we use ‘:(2)’, for `DQA1_1’ and `DQA1_2’, then use use ‘_1:_2’, the latter case distinguishes both fields from the stem] (default: ‘:(2)’)
‘alleleDesignator’: The first character of the key which
determines whether this column contains allele data. Defaults to ‘*’
‘popNameDesignator’: The first character of the key which
determines whether this column contains the population name. Defaults to ‘+’
‘debug’: Switches debugging on if set to ‘1’ (default: no debugging, ‘0’)
- getPopData()#
Returns a dictionary of population data.
Dictionary is keyed by types specified in population metadata file
- getSampleMap()#
Returns dictionary of sample data.
Each dictionary position contains either a 2-tuple of column position or a single column position keyed by field originally specified in sample metadata file
- getFileData()#
Returns file data.
Returns a 2-tuple `wrapper’:
raw sample lines, without header metadata.
the field separator.
- genSampleOutput(fieldList)#
Prints the data specified in ordered field list.
Use is currently deprecated.
- serializeMetadataTo(stream)#
- class ParseGenotypeFile(filename, untypedAllele='****', **kw)#
Bases:
ParseFile
Class to parse standard datafile in genotype form.
Constructor for ParseGenotypeFile.
‘filename’: filename for the file to be parsed.
In addition to the arguments for the base class, this class accepts the following additional keywords:
‘untypedAllele’: The designator for an untyped locus. Defaults
- genValidKey(field, fieldList)#
Check and validate key.
‘field’: string with field name.
‘fieldList’: a dictionary of valid fields.
Check to see whether ‘field’ is a valid key, and generate the appropriate ‘key’. Returns a 2-tuple consisting of ‘isValidKey’ boolean and the ‘key’.
Note: this is explicitly done in the subclass of the abstract ‘ParseFile’ class (i.e. since this subclass should have `knowledge’ about the nature of fields, but the abstract class should not have)
- getMatrix()#
Returns the genotype data.
Returns the genotype data in a ‘StringMatrix’ NumPy array.
- serializeSubclassMetadataTo(stream)#
Serialize subclass-specific metadata.
- class ParseAlleleCountFile(filename, **kw)#
Bases:
ParseFile
Class to parse datafile in allele count form.
Currently only handles one locus per population, in format:
<metadata-line1> <metadata-line2> DQA1 count 0102 20 0103 33 …
Currently a prototype implementation.
Constructor for ParseFile object.
‘filename’: filename for the file to be parsed.
- ‘validPopFields’: a string consisting of valid headers (one
per line) for overall population data (no default)
- ‘validSampleFields’: a string consisting of valid headers
(one per line) for lines of sample data. (no default)
- ‘separator’: separator for adjacent fields (default: a tab
stop, ‘t’).
‘fieldPairDesignator’: a string which consists of additions to the allele `stem’ for fields grouped in pairs (allele fields) [e.g. for `HLA-A’, and `HLA-A(2)’, then we use ‘:(2)’, for `DQA1_1’ and `DQA1_2’, then use use ‘_1:_2’, the latter case distinguishes both fields from the stem] (default: ‘:(2)’)
‘alleleDesignator’: The first character of the key which
determines whether this column contains allele data. Defaults to ‘*’
‘popNameDesignator’: The first character of the key which
determines whether this column contains the population name. Defaults to ‘+’
‘debug’: Switches debugging on if set to ‘1’ (default: no debugging, ‘0’)
- genValidKey(field, fieldList)#
Checks to see validity of a field.
Given a ‘field’, this is checked against the ‘fieldList’ and a tuple of a boolean (key is valid) and a a key is returned.
The first element in the ‘fieldList’ which is a locus name, can match one of many loci (delimited by colons ‘:’). E.g. it may look like:
‘DQA1:DRA:DQB1’
If the field in the input file match any of these keys, return the field and a valid match.
- serializeSubclassMetadataTo(stream)#
- getAlleleTable()#
- getLocusName()#
- getMatrix()#
Returns the genotype data.
Returns the genotype data in a ‘StringMatrix’ NumPy array.