2. Getting started with PyPop

2.1. Introduction

You may use PyPop to analyze many different kinds of data, including allele-level genotype data (as in Listing 2.1), allele-level frequency data (as in Listing 2.6), microsatellite data, SNP data, and nucleotide and amino acid sequence data.

As mentioned in the installation chapter, a minimal working example of a configuration file (.ini), and a population file (.pop), can be found by clicking the respective links.

There are two ways to run PyPop:

  • interactive mode (where the program will prompt you to directly type the input it needs); and

  • batch mode (where you supply all the command line options the program needs).

For the most simplest application of PyPop, where you wish to analyze a single population, the interactive mode is the simplest to use. We will describe this mode first then describe batch mode.

Note

The following assumes you have already installed PyPop, done any post-install adjustments needed for your platform, and verified that you can run the main commands (see the Examples section).

Interactive mode

To run PyPop in interactive mode, with a minimal “GUI”, on Windows or MacOS, you can directly click on the pypop-interactive file in the directory where the scripts were installed (see post-install adjustments).

You can also type pypop-interactive after starting a console application on all platforms (on MacOS and GNU/Linux, this is normally the Terminal program, on Windows, it’s Command prompt).

In most cases, this will launch a console with the following:

PyPop: Python for Population Genomics (1.0.0)
[Python 3.10.9 | Linux.x86_64-x86_64 | x86_64]
Copyright (C) 2003-2006 Regents of the University of California
Copyright (C) 2007-2023 PyPop team.
This is free software.  There is NO warranty; not even for
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

You may redistribute copies of PyPop under the terms of the GNU
General Public License.  For more information about these
matters, see the file named COPYING.

Select both an '.ini' configuration file and a '.pop' file via the
system file dialog.

Following this:

  1. the system file dialog will appear prompting you to select an .ini configuration file.

  2. a second system file dialog will prompt you for a .pop data file.

  3. after both files are selected the console will display the processing of the file:

    PyPop is processing sample.pop ...
    PyPop run complete!
    XML output(s) can be found in: ['sample-out.xml']
    Plain text output(s) can be found in: ['sample-out.txt']
    Press Enter to continue...
    
  4. when the run is completed, the last line will prompt you to press Enter to leave the console window (highlighted above).

If the system file GUI dialog does not appear (e.g. if you are running on a terminal without a display), it will fall-back to text-mode entry for the files, where you need to type the full (either relative or absolute) paths to the files. The output should resemble:

PyPop: Python for Population Genomics (1.0.0)
[Python 3.10.9 | Linux.x86_64-x86_64 | x86_64]
Copyright (C) 2003-2006 Regents of the University of California
Copyright (C) 2007-2023 PyPop team.
This is free software.  There is NO warranty; not even for
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

You may redistribute copies of PyPop under the terms of the GNU
General Public License.  For more information about these
matters, see the file named COPYING.

To accept the default in brackets for each filename, simply press
return for each prompt.

Please enter config filename [config.ini]: sample.ini
Please enter population filename [no default]: sample.pop
PyPop is processing sample.pop ...
PyPop run complete!
XML output(s) can be found in: ['sample-out.xml']
Plain text output(s) can be found in: ['sample-out.txt']
Press Enter to continue...

Note

Some messages with the prefix “LOG:” may appear during the console operation. They are informational only and do not indicate improper operation of the program.

In both cases you should substitute the names of your own configuration (e.g., config.ini) and population file (e.g., Guatemalan.pop) for sample.ini and sample.pop (highlighted above). The formats for these files are described in the sections on the data file and configuration file, below.

Batch mode

To run PyPop in the more common “batch mode”, you can run PyPop from the console (as noted above, on Windows: open Command prompt, aka a “DOS shell”; on MacOS or GNU/Linux: open the Terminal application). Change to a directory where your .pop file is located, and type the command:

pypop Guatemalan.pop

Note

If your system administrator has installed PyPop the name of the script may be renamed to something different.

Batch mode assumes two things: that you have a file called config.ini in your current folder and that you also have your population file is in the current folder, otherwise you will need to supply the full path to the file. You can specify a particular configuration file for PyPop to use, by supplying the -c option as follows:

pypop -c newconfig.ini Guatemalan.pop

You may also redirect the output to a different directory (which must already exist) by using the -o option:

pypop -c newconfig.ini -o altdir Guatemalan.pop

Please see pypop usage for the full list of command-line options.

What happens when you run PyPop?

The most common types of analysis will involve the editing of your config.ini file to suit your data (see the configuration file) followed by the selection of either the interactive or batch mode described above. If your input configuration file is configfilename and your population file name is popfilename.txt the initial output will be generated quickly, but your the PyPop execution will not be finished until the text output file named popfilename-out.txt has been created. A successful run will produce two output files: popfilename-out.xml, popfilename-out.txt. A third output file will be created if you are using the Anthony Nolan HLA filter option for HLA data to check your input for valid/known HLA alleles: popfilename-filter.xml).

The popfilename-out.xml file is the primary output created by PyPop and the human-readable popfilename-out.txt file is a summary of the complete XML output. The XML output can be further transformed into plain text TSV files, either directly via pypop if invoked on multiple input files (using the --enable-tsv option, see pypop usage), or via the popmeta tool that aggregates results from different pypop runs (see Using popmeta to aggregate results).

A typical PyPop run might take anywhere from a few of minutes to a few hours, depending on how large your data set is and who else is using the system at the same time. Note that performing the allPairwiseLDWithPermu test may take several days if you have highly polymorphic loci in your data set.

2.2. Using popmeta to aggregate results

The popmeta script can aggregate results from a number of output XML files from individual populations into a set of tab-separated (TSV) files containing summary statistics via customized XSLT (eXtensible Stylesheet Language for Transformations) stylesheets. These TSV files can be directly imported into a spreadsheet or statistical software (e.g., R, SAS). In addition, there is some preliminary support for export into other formats, such as the population genetic software (e.g., PHYLIP).

Here is an example of a popmeta run, following on from the XML outputs generated in similar fashion in the previous pypop runs:

popmeta -o altdir Guatemalan-out.xml NorthAmerican-out.xml

This will generate a number of .tsv files, in the output directory altdir, of the form 1-locus-allele.tsv, 1-locus-summary.tsv, etc.

You can also supply a prefix to the command-line option --prefix-tsv so that all .tsv files are given a prefix, e.g.,

popmeta -o altdir --prefix-tsv myoutput Guatemalan-out.xml NorthAmerican-out.xml

Will result in files with a prefix, e.g. myoutput-1-locus-allele.tsv.

Note

It’s highly recommended to use the -o option to save the output in a separate subdirectory, as the output .tsv files have fixed names, and will overwrite any files in the local directory with the same name. See popmeta usage for the full list of options.

Note that a similar effect can be achieved directly from a pypop run (assuming that the configuration file can be used for both .pop population files), by invoking pypop with the --enable-tsv option:

pypop -c newconfig.ini -o altdir Guatemalan.pop NorthAmerican.pop --enable-tsv

2.3. Command-line interfaces

Described below is the usage for both programs, including a full list of the current command-line options and arguments. Note that you can also view this full list of options from the program itself by supplying the --help option, i.e. pypop --help, or popmeta --help, respectively.

pypop usage

usage: pypop [-h] [-o OUTPUTDIR] [-V] [-c CONFIG] [-m] [-d] [-x XSLFILE] [-t]
             [--enable-ihwg] [--enable-phylip] [-p PREFIX_TSV] [-i]
             [-f FILELIST]
             [POPFILE ...]

Options for pypop

-o, --outputdir

put output in directory OUTPUTDIR

-V, --version

show program’s version number and exit

-c, --config

select config file

Default: 'config.ini'

-m, --testmode

run PyPop in test mode for unit testing

-d, --debug

enable debugging output (overrides config file setting)

-x, --xsl

override the default XSLT translation with XSLFILE

TSV output options

Note that --enable-* and --prefix-tsv options are only valid if --enable-tsv/-t is also supplied

-t, --enable-tsv

generate TSV output files (aka run ‘popmeta’)

--enable-ihwg

enable 13th IWHG workshop populationdata default headers

--enable-phylip

enable generation of PHYLIP .phy files

-p, --prefix-tsv

append PREFIX_TSV to the output TSV files

Mutually exclusive input options

-i, --interactive

run in interactive mode, prompting user for file names

-f, --filelist

file containing list of files (one per line) to process (mutually exclusive with supplying POPFILEs)

POPFILE

input population (.pop) file(s)

Default: []

popmeta usage

usage: popmeta [-h] [-o OUTPUTDIR] [-V] [-p PREFIX_TSV] [--disable-tsv]
               [--output-meta] [-x XSLDIR] [--enable-ihwg]
               [--enable-phylip | -b FACTOR]
               XMLFILE [XMLFILE ...]

Positional Arguments

XMLFILE

XML (.xml) file(s) generated by pypop runs

Default: []

Options for popmeta

-o, --outputdir

put output in directory OUTPUTDIR

-V, --version

show program’s version number and exit

-p, --prefix-tsv

append PREFIX_TSV to the output TSV files

--disable-tsv

disable generation of .tsv TSV files

--output-meta

dump the meta output file to stdout, ignore xslt file

-x, --xsldir

use specified directory to find meta XSLT

--enable-ihwg

enable 13th IWHG workshop populationdata default headers

Mutually exclusive popmeta options

--enable-phylip

enable generation of PHYLIP .phy files

-b, --batchsize

process in batches of size total/FACTOR rather than all at once, by default do separately (batchsize=0)

Default: 0

2.4. The data file

Sample files

Data can be input either as genotypes, or in an allele count format, depending on the format of your data.

Data files are tab-delimited

These population files are plain text files, such as you might save out of the Notepad application on Windows (or Emacs). The columns are all tab-delimited, so you can include spaces in your labels. If you have your data in a spreadsheet application, such as Excel or LibreOffice, export the file as tab-delimited text, in order to use it as PyPop data file.

As you will see in the following examples, population files begin with header information. In the simplest case, the first line contains the column headers for the genotype, allele count, or, sequence information from the population. If the file contains a population data-block, then the first line consists of headers identifying the data on the second line, and the third line contains the column headers for the genotype or allele count information.

Note that for genotype data, each locus corresponds to two columns in the population file. The locus name must repeated, with a suffix such as _1, _2 (the default) or _a, _b and must match the format defined in the config.ini (see validSampleFields). Although PyPop needs this distinction to be made, phase is NOT assumed, and if known it is ignored.

Listing 2.7 shows the relevant lines for the configuration to read in the data shown in Listing 2.1 and Listing 2.2.

Listing 2.1 Multi-locus allele-level genotype data
a_1	a_2	c_1	c_2	b_1	b_2
****	****	01:02	02:025	13:01	18:012
01:01	02:01	03:07	06:05	14:01	39:021
02:10	03:012	07:12	01:02	15:20	13:01
01:01	02:18	08:04	12:02	35:091	40:05
25:01	02:01	15:07	03:07	51:013	14:01
02:10	32:04	18:01	01:02	78:021	13:01
03:012	32:04	15:07	06:05	51:013	39:021

This is an example of the simplest kind of data file. Note that the columns in the header do not appear to align, but that is due to tab separation. You can copy and paste the data into a text editor to see the tabs.

Listing 2.2 Multi-locus allele-level HLA genotype data with sample information
populat	id	a_1	a_2	c_1	c_2	b_1	b_2
UchiTelle	UT900-23	****	****	01:02	02:025	13:01	18:012
UchiTelle	UT900-24	01:01	02:01	03:07	06:05	14:01	39:021
UchiTelle	UT900-25	02:10	03:012	07:12	01:02	15:20	13:01
UchiTelle	UT900-26	01:01	02:18	08:04	12:02	35:091	40:05
UchiTelle	UT910-01	25:01	02:01	15:07	03:07	51:013	14:01
UchiTelle	UT910-02	02:10	32:04	18:01	01:02	78:021	13:01
UchiTelle	UT910-03	03:012	32:04	15:07	06:05	51:013	39:021

This example shows a data file which has non-allele data in some columns, here we have population (populat) and sample identifiers (id).

Listing 2.3 Multi-locus allele-level HLA genotype data with sample and header information
labcode	method	ethnic	contin	collect	latit	longit
USAFEL	12th Workshop SSOP	Telle	NW Asia	Targen Village	41 deg 12 min N	94 deg 7 min E
populat	id	a_1	a_2	c_1	c_2	b_1	b_2
UchiTelle	UT900-23	****	****	01:02	02:025	13:01	18:012
UchiTelle	UT900-24	01:01	02:01	03:07	06:05	14:01	39:021
UchiTelle	UT900-25	02:10	03:012	07:12	01:02	15:20	13:01
UchiTelle	UT900-26	01:01	02:18	08:04	12:02	35:091	40:05
UchiTelle	UT910-01	25:01	02:01	15:07	03:07	51:013	14:01
UchiTelle	UT910-02	02:10	32:04	18:01	01:02	78:021	13:01
UchiTelle	UT910-03	03:012	32:04	15:07	06:05	51:013	39:021

This is an example of a data file which is identical to Listing 2.2, but which includes population level information.

Listing 2.4 Multi-locus allele-level HLA genotype and microsatellite genotype data with header information
labcode	ethnic	complex
USAFEL	****	0
populat	id	drb1_1	drb1_2	dqb1_1	dqb1_2	d6s2222_1	d6s2222_2
UchiTelle	HJK_2	01	03:01	02:01	05:01	249	249
UchiTelle	HJK_1	03:01	03:01	02:01	02:01	249	249
UchiTelle	HJK_3	01	03:01	02:01	05:01	249	249
UchiTelle	HJK_4	01	03:01	02:01	05:01	249	249
UchiTelle	MYU_2	02	04:01	03:02	06:02	247	249
UchiTelle	MYU_1	03:01	03:01	02:01	02:01	247	249
UchiTelle	MYU_3	03:01	04:01	02:01	03:02	249	249
UchiTelle	MYU_4	03:01	04:01	02:01	03:02	247	249

This example mixes different kinds of data: HLA allele data (from DRB1 and DQB1 loci) with microsatellite data (locus D6S2222).

Listing 2.5 Sequence genotype data with header information
labcode	file
BLOGGS	C_New
popName	ID	TGFB1cdn10(1)	TGFB1cdn10(2)	TGFBhapl(1)	TGFBhapl(2)
Urboro	XQ-1	C	T	CG	TG
Urboro	XQ-2	C	C	CG	CG
Urboro	XQ-5	C	T	CG	TG
Urboro	XQ-21	C	T	CG	TG
Urboro	XQ-7	C	T	CG	TG
Urboro	XQ-20	C	T	CG	TG
Urboro	XQ-6	T	T	TG	TG
Urboro	XQ-8	C	T	CG	TG
Urboro	XQ-9	T	T	TG	TG
Urboro	XQ-10	C	T	CG	TG

This example includes nucleotide sequence data: the TGFB1CDN10 locus consists of one nucleotide, the TGFBhapl locus is actually haplotype data, but PyPop simply treats each combination as a separate “allele” for subsequent analysis.

Listing 2.6 Allele count data
populat	method	ethnic	country	latit	longit
UchiTelle	PCR-SSO	Klingon	QZ	052.81N	100.25E
dqa1	count
01:01	31
01:02	37
01:03	17
02:01	21
03:01	32
04:01	9
05:01	35

PyPop can also process allele count data. However, you cannot mix allele count data and genotype data together in the one file.

Note

Currently each .pop file can only contain allele count data for one locus. In order to process multiple loci for one population you must create a separate .pop for each locus.

Missing data

Untyped or missing data may be represented in a variety of ways. The default value for untyped or missing data is a series of four asterisks (****) as specified by the config.ini. You may not “represent” untyped data by leaving a column blank, nor may you represent a homozygote by leaving the second column blank. All cells for which you have data must include data, and all cells for which you do not have data must also be filled in, using a missing data value.

For individuals who were not typed at all loci, the data in loci for which they are typed will be used on all single-locus analyses for that individual and locus, so that you see the value of the number of individuals (n) vary from locus to locus in the output. These individuals’ data will also be used for multi-locus analyses. Only the loci that contain no missing data will be included in any multi-locus analysis.

If an individual is only partially typed at a locus, it will be treated as if it were completely untyped, and data for that individual for that locus will be dropped from ALL analyses.

Warning

  • Do not leave trailing blank lines at the end of your data file, as this currently causes PyPop to terminate with an error message that takes experience to diagnose.

  • For haplotype estimation and linkage disequilibrium calculations (i.e., the emhaplofreq part of the program) you are currently restricted to a maximum of seven loci per haplotype request. For haplotype estimation there is a limit of 5000 for the number of individuals (n) [1]

2.5. The configuration file

The sets of population genetic analyses that are run on your population data file and the manner in which the data file is interpreted by PyPop is controlled by a configuration file, the default name for which is config.ini. This is another plain text file consisting of comments (which are lines that start with a semi-colon), sections (which are lines with labels in square brackets), and options (which are lines specifying settings relevant to that section in the option=value format).

Note

If any option runs over one line (such as validSampleFields) then the second and subsequent lines must be indented by exactly one space.

A minimal configuration file

Here we present a minimal .ini file corresponding to Listing 2.1 A section by section review of this file follows. (Note comment lines have been omitted in the above example for clarity). A description of more advanced options is contained in Advanced options.

Listing 2.7 Minimal config.ini file
[General]
debug=0

[ParseGenotypeFile]
untypedAllele=****
alleleDesignator=*
validSampleFields=*a_1
 *a_2
 *c_1
 *c_2
 *b_1
 *b_2

[HardyWeinberg]
lumpBelow=5

[HardyWeinbergGuoThompson]
dememorizationSteps=2000
samplingNum=1000
samplingSize=1000

[HomozygosityEWSlatkinExact]
numReplicates=10000

[Emhaplofreq]
allPairwiseLD=1
allPairwiseLDWithPermu=0
;;numPermuInitCond=5

Configuration file sections (highlighted above)

  • [General]

    This section contains variables that control the overall behavior of PyPop.

    • debug=0.

      This setting is for debugging. Setting it to 1 will set off a large amount of output of no interest to the general user. It should not be used unless you are running into trouble and need to communicate with the PyPop developers about the problems.

  • Specifying data formats

    There are two possible formats: [ParseGenotypeFile] and [ParseAlleleCountFile]

    [ParseGenotypeFile].

    If your data is genotype data, you will want a section labeled: [ParseGenotypeFile].

    • alleleDesignator.

      This option is used to tell PyPop what is allele data and what isn’t. You must use this symbol in :ref:`validSampleFields option. The default is *. In general, you won’t need to change it. [Default: * ]

    • untypedAllele.

      This option is used to tell PyPop what symbol you have used in your data files to represent untyped or unknown data fields. These fields MAY NOT BE LEFT BLANK. You must use something consistent that cannot be confused with real data here. [Default: **** ]

  • validSampleFields.

    This option should contain the names of the loci immediately preceding your genotype data (if it has three header lines, this information will be on the third line, otherwise it will be the first line of the file).[There is no default, this option must always be present]

    The format is as follows, for each sample field (which may either be an identifying field for the sample such as populat, or contain allele data) create a new line where:

    • The first line (validSampleFields=) consists of the name of your sample field (if it contains allele data, the name of the field should be preceded by the character designated in the alleleDesignator option above).

    • All subsequent lines after the first must be preceded by one space (again if it contains allele data, the name of the field should be preceded by the character designated in the alleleDesignator option above).

    Here is an example:

    validSampleFields=*a_1
     *a_2
     *c_1
     *c_2
     *b_1
     *b_2    Note initial space at start of line.
    

    Here is example that includes identifying (non-allele data) information such as sample id (id) and population name (populat):

    validSampleFields=populat
     id
     *a_1
     *a_2
     *c_1
     *c_2
     *b_1
     *b_2
    

[ParseAlleleCountFile].

If your data is not genotype data, but rather, data of the allele-name count format, then you will want to use the [ParseAlleleCountFile] section INSTEAD of the [ParseGenotypeFile] section. The alleleDesignator and untypedAllele options work identically to that described for [ParseGenotypeFile].

  • validSampleFields.

    This option should contain either a single locus name or a colon-separated list of all loci that will be in the data files you intend to analyze using a specific .ini file. The colon-separated list allows you to avoid changing the .ini file when running over a collection of data files containing different loci. e.g.,

    validSampleFields=A:B:C:DQA1:DQB1:DRB1:DPB1:DPA1
     count
    

    Note that each .pop file must contain only one locus (see the note in Listing 2.6). Listing multiple loci simply permits the same .ini file to be reused for each data file.

  • [HardyWeinberg]

    Hardy-Weinberg analysis is enabled by the presence of this section.

    • lumpBelow.

      This option value represents a cut-off value. Alleles with an expected value equal to or less than lumpBelow will be lumped together into a single category for the purpose of calculating the degrees of freedom and overall p-value for the chi-squared Hardy-Weinberg test.

  • [HardyWeinbergGuoThompson]

    When this section is present, an implementation of the Hardy-Weinberg exact test is run using the original Guo and Thompson (1992) code, using a Monte-Carlo Markov chain (MCMC). In addition, two measures (Chen and Diff) of the goodness of it of individual genotypes are reported under this option (Chen et al., 1999). By default this section is not enabled. This is a different implementation to the Arlequin version listed in Advanced options, below.

    • dememorizationSteps.

      Number of steps of to “burn-in” the Markov chain before statistics are collected.[Default: 2000 ]

    • samplingNum.

      Number of Markov chain samples [Default: 1000 ].

    • samplingSize.

      Markov chain sample size[Default: 1000 ].

    Note that the total number of steps in the Monte-Carlo Markov chain is the product of samplingNum and samplingSize, so the default values described above would contain 1,000,000 (= 1000 x 1000) steps in the MCMC chain.

    The default values for options described above have proved to be optimal for us and if the options are not provided these defaults will be used. If you change the values and have problems, please let us know.

  • [HomozygosityEWSlatkinExact]

    The presence of this section enables Slatkin’s (1994) implementation of the Ewens-Watterson exact test of neutrality.

    • numReplicates.

      The default values have proved to be optimal for us. There is no reason to change them unless you are particularly curious. If you change the default values and have problems, please let us know.

  • [Emhaplofreq]

    The presence of this section enables haplotype frequency estimation and calculation of linkage disequilibrium (LD) measures. Please note that PyPop assumes that the genotype data is unphased when estimating haplotype frequencies and LD measures.

    • lociToEstHaplo.

      In this option you can list the multi-locus haplotypes for which you wish the program to estimate and to calculate the LD. It should be a comma-separated list of colon-joined loci. e.g.,

      lociToEstHaplo=a:b:drb1,a:b:c,drb1:dqa1:dpb1,drb1:dqb1:dpb1
      
    • allPairwiseLD.

      Set this to 1 (one) if you want the program to calculate all pairwise LD for your data, otherwise set this to 0 (zero).

  • allPairwiseLDWithPermu.

    Set this to a positive integer greater than 1 if you need to determine the significance of the pairwise LD measures in the previous section. The number you use is the number of permutations that will be run to ascertain the significance (this should be at least 1000 or greater). (Note this is done via permutation testing performed after the pairwise LD test for all pairs of loci. Note also that this test can take DAYS if your data is highly polymorphic.)

  • numPermuInitCond.

    Set this to change the number of initial conditions used per permutation. [Default: 5 ]. (Note: this parameter is only used if allPairwiseLDWithPermu is set and nonzero).

Advanced options

The following section describes additional options to previously described sections. Most of the time these options can be omitted and PyPop will choose defaults, however these advanced options do offer greater control over the application. In particular, customization will be required for data that has sample identifiers as in Listing 2.2 or header data block as in Listing 2.3 and both validSampleFields (described above) and validPopFields (described below) will need to be modified.

It also describes two extra sections related to using PyPop in conjunction with Arlequin: [Arlequin] and [HardyWeinbergGuoThompsonArlequin].

[General] advanced options

  • txtOutFilename and xmlOutFilename.

    If you wish to specify a particular name for the output file, which you want to remain identical over several runs, you can set these two items to particular values. The default is to have the program select the output filename, which can be controlled by the next variable. [Default: not used]

  • outFilePrefixType.

    This option can either be omitted entirely (in which case the default will be filename) or be set in several ways. The default is set as filename, which will result in three output files named original-filename-minus-suffix-out.xml, original-filename-minus-suffix-out.txt, and original-filename-minus-suffix-filter.xml. [Default: filename ]

    If you set the value to date instead of filename, you’ll get the date incorporated in the filename as follows: original-filename-minus-suffix-YYYY-nn-dd-HH-MM-SS-out.xml,txt. e.g., USAFEL-UchiTelle-2003-09-21-01-29-35-out.xml (where Y, n, d, H, M, S refer to year, month, day, hour, minute and second, respectively).

  • xslFilename.

    This option specifies where to find the XSLT file to use for transforming PyPop’s xml output into human-readable form. Most users will not normally need to set this option, and the default is the system-installed text.xsl file.

[ParseGenotypeFile] advanced options

  • fieldPairDesignator.

    This option allows you to override the coding for the headers for each pair of alleles at each locus; it must match the entry in the config file under validSampleFields and the entries in your population data file. If you want to use something other than _1 and _2, change this option, for instance, to use letters and parentheses, change it as follows: fieldPairDesignator=(a):(b) [Default: _1:_2 ]

  • popNameDesignator.

    There is a special designator to mark the population name field, which is usually the first field in the data block. [Default: + ]

    If you are analyzing data that contains a population name for each sample, then the first entry in your validSampleFields section should have a prefixed +, as below:

    validSampleFields=+populat
     *a_1
     *a_2
     ...
    
  • validPopFields.

    If you are analyzing data with an initial two line population header block information as in Multi-locus allele-level HLA genotype data with sample and header information, then you will need to set this option. In this case, it should contain the field names in the first line of the header information of your file. [Default: required when a population data-block is present in data file], e.g.:

    validPopFields=labcode
     method
     ethnic
     country
     latit
     longit
    

[Emhaplofreq] advanced options

  • permutationPrintFlag.

    Determines whether the likelihood ratio for each permutation will be logged to the XML output file, this is disabled by default. [Default: 0 (i.e. OFF)].

    Warning

    If this is enabled it can drastically increase the size of the output XML file on the order of the product of the number of possible pairwise comparisons and permutations. Machines with lower RAM and disk space may have difficulty coping with this.

[Arlequin] extra section

This section sets characteristics of the Arlequin application if it has been installed (it must be installed separately from PyPop as we cannot distribute it). The options in this section are only used when a test requiring Arlequin, such as it’s implementation of Guo and Thompson’s (1992) Hardy-Weinberg exact test is invoked (see below).

  • arlequinExec.

    This option specifies where to find the Arlequin executable on your system. The default assumes it is on your system path. [Default: arlecore.exe ]

[HardyWeinbergGuoThompsonArlequin] extra section

When this section is present, Arlequin’s implementation of the Hardy-Weinberg exact test is run, using a Monte-Carlo Markov Chain implementation. By default this section is not enabled.

  • markovChainStepsHW.

    Length of steps in the Markov chain [Default: 2500000].

  • markovChainDememorisationStepsHW.

    Number of steps of to “burn-in” the Markov chain before statistics are collected.[Default: 5000 ]

The default values for options described above have proved to be optimal for us and if the options are not provided these defaults will be used. If you change the values and have problems, please let us know.

[Filters] extra section

When this section is present, it allows you to specify succesive filters to the data.

  • filtersToApply.

    Here you specify which filters you want applied to the data and the order in which you want them applied. Separate each filter name with a colon (:). Currently there are four predefined filter: AnthonyNolan, Sequence, DigitBinning, and CustomBinning. If you specify one or more of these filters, you will get the default behavior of the filter. If you wish to modify the default behavior, you should add a section with the same name as the specified filter(s). See next section for more on this. Please note that, while you are allowed to specify any ordering for the filters, some orderings may not make sense. For example, the ordering Sequence:AnthonyNolan would not make sense (because as far as PyPop is concerned, your alleles are now amino acid residues.) However, the reverse ordering, AnthonyNolan:Sequence, would be logical and perhaps even advisable.

[AnthonyNolan] filter section

This section is only useful for HLA data. Like all filter sections, it will only be used if present in the filtersToApply line specified above. If so enabled, your data will be filtered through the Anthony Nolan database of known HLA allele names before processing. The data files this filter relies on are not currently distributed with PyPop but can be obtained via the IMGT ftp site. Invocation of this filter will produce a popfile-filter.xml file output showing what was resolved and what could not be resolved.

  • alleleFileFormat.

    This options specifies which of the formats the Anthony Nolan allele data will be used. The option can be set to either txt (for the plain free text format) or msf (for the Multiple Sequence Format) [Default: msf ]

  • directory.

    Specifies the path to the root of the sequence files. For txt: [Default: prefix/share/PyPop/anthonynolan/HIG-seq-pep-text/ ]. For msf files [Default: prefix/share/PyPop/anthonynolan/msf/ ].

  • preserve-ambiguous.

    The default behavior of the AnthonyNolan filter is to ignore allele ambiguity (“slash”) notation. This notation, common in the literature, looks like: 010101/0102/010301. The default behavior will simply truncate this to 0101. If you want to preserve the notation, set the option to 1. This will result in a filtered allele “name” of 0101/0102/0103 in the above hypothetical example. [Default: 0 ].

  • preserve-unknown.

    The default behavior of the AnthonyNolan filter is to replace unknown alleles with the untypedAllele designator. If you want the filter to keep allele names it does not recognize, set the option to 1. [Default: 0 ].

  • preserve-lowres.

    This option is similar to preserve-unknown, but only applies to lowres alleles. If set to 1, PyPop will keep allele names that are shorter than the default allele name length, usually 4 digits long. But if the preserve-unknown flag is set, this one has no effect, because all unknown alleles are preserved. [Default: 0 ].

[Sequence] filter section

This section allows configuration of the sequence filter. Like all filter sections, it will only will be used if present in the filtersToApply line specified above. If so enabled, your allele names will be translated into sequences, and all ensuing analyses will consider each position in the sequence to be a distinct locus. This filter makes use of the same msf format alignment files as used above in the AnthonyNolan filter. It does not work with the txt format alignment files.

  • sequenceFileSuffix.

    Determines the files that will be examined in order to read in a sequence for each allele. (ie, if the file for locus A is A_prot.msf, the value would be _prot whereas if you wanted to use the nucleotide sequence files, you might use _nuc.) [Default: _prot ].

  • directory.

    Specifies the path to the root of the sequence files, in the same manner as in the AnthonyNolan section, above.

[DigitBinning] filter section

This section allows configuration of the DigitBinning filter. Like all filter sections, it will be used if present in the filtersToApply line specified above. If so enabled, your allele names will be truncated after the nth digit.

  • binningDigits.

    An integer that specifies how many digits to keep after the truncation. [Default: 4 ].

[CustomBinning] filter section

This section allows configuration of the CustomBinning filter. Like all filter sections, it will only be used if present in the filtersToApply line specified above.

You can provide a set of custom rules for replacing allele names. Allele names should be separated by / marks. This filter matches any allele names that are exactly the same as the ones you list here, and will also find “close matches” (but only if there are no exact matches.). Here is an example:

A=01/02/03
 04/05/03:06
 !06/12:01/13:01
 !07/08:05

In the example above, A*03 alleles will match to 01/02/03, except for A*03:06, which will match to 04/05/03:06. In the output file, A*03:06 will be replaced with 04/05/03:06 and other A*03 alleles will be replaced with 01/02/03. If you place a ! mark in front of the first allele name, that first name will be used as the “new name” for the binned group (for example, A*08:05 will be called 07 in the custom-binned data.) Note that the space at the beginning of the lines (following the first line of each locus) is important. The above rules are just dummy examples, provided to illustrate how the filter works. PyPop is distributed with a biologically relevant set of CustomBinning rules that have been compiled from several (Cano et al., 2007, Mack et al., 2007) sources [2]