Chapter 2. Getting started with PyPop

Last updated: $Date: 2008/11/19 05:14:04 $ by $Author: alancast $

Table of Contents

2.1. Introduction
2.1.1. Interactive mode
2.1.2. Batch mode
2.1.3. What happens when you run PyPop?
2.2. The data file
2.2.1. Sample files
2.2.2. Missing data
2.3. The configuration file
2.3.1. A minimal configuration file
2.3.2. Advanced options

2.1. Introduction

You may use PyPop to analyze many different kinds of data, including allele-level genotype data (as in Example 2.1, “Multi-locus allele-level genotype data”), allele-level frequency data (as in Example 2.6, “Allele count data”), microsatellite data, SNP data, and nucleotide and amino acid sequence data.

There are two ways to run PyPop:

  • interactive mode (where the program will prompt you to directly type the input it needs); and

  • batch mode (where you supply all the command line options the program needs).

For the most straightforward application of PyPop, where you wish to analyze a single population, the interactive mode is the simplest to use. We will describe this mode first then describe batch mode.

2.1.1. Interactive mode

To run PyPop, click the pypop.bat file (Windows) or type ./pypop at the command prompt (GNU/Linux). You should see something like the following output (this is also described in detail in the instructions in the installation guide):

PyPop: Python for Population Genomics (0.4.3)
Copyright (C) 2003 Regents of the University of California
This is free software.  There is NO warranty; not even for
You may redistribute copies of PyPop under the terms of the
GNU General Public License.  For more information about these
matters, see the file named COPYING.
To accept the default in brackets for each filename, simply press
return for each prompt.
Please enter config filename [config.ini]: sample.ini
Please enter population filename [no default]: sample.pop
PyPop is processing sample.pop 

(Note: some messages with the prefix "LOG:" may appear here.
They are informational only and do not indicate improper operation 
of the program)

PyPop run complete!
XML output can be found in: sample-out.xml
Plain text output can be found in: sample-out.txt

You should substitute the names of your own configuration (e.g., config.ini) and population file (e.g., Guatemalan.pop) for sample.ini and sample.pop. The formats for these files are described in Section 2.2, “The data file” and Section 2.3, “The configuration file”, below.

2.1.2. Batch mode

To run PyPop in batch mode, you can start PyPop from the command line (in Windows: open a DOS shell, GNU/Linux: open a terminal window), change to the directory where you unpacked PyPop and type

pypop-batch Guatemalan.pop

If your system administrator has installed PyPop the name of the script may be renamed to something different.

Batch mode assumes two things: that you have a file called config.ini in your current folder and that you also have your population file also in the current folder. You can specify a particular configuration file for PyPop to use, by supplying the -c option as follows:

pypop-batch -c newconfig.ini Guatemalan.pop

You may also redirect the output to a different directory (which must already exist) by using the -o option:

pypop-batch -c newconfig.ini -o altdir Guatemalan.pop

For a full list of options supported by PyPop, type pypop-batch --help. You should receive a screen resembling the following:

Process and run population genetics statistics on an INPUTFILE.
Expects to find a configuration file called 'config.ini' in the
current directory or in /usr/share/PyPop/config.ini.

  -l, --use-libxslt    filter XML via XSLT using libxslt (default)
  -s, --use-4suite     filter XML via XSLT using 4Suite
  -x, --xsl=FILE       use XSLT translation file FILE
  -h, --help           show this message
  -c, --config=FILE    select alternative config file
  -d, --debug          enable debugging output (overrides config file setting)
  -i, --interactive    run in interactive mode, prompting user for file names
  -g, --gui            run GUI (currently disabled)
  -o, --outputdir=DIR  put output in directory DIR
  -V, --version        print version of PyPop
    INPUTFILE   input text file

Documentation for these options is underway, but not currently available.

2.1.3. What happens when you run PyPop?

The most common types of analysis will involve the editing of your config.ini file to suit your data (see Section 2.3, “The configuration file”) followed by the selection of either the interactive or batch mode described above. If your input configuration file is configfilename and your population file name is popfilename.txt the initial output will be generated quickly, but your the PyPop execution will not be finished until the text output file named popfilename-out.txt has been created. A successful run will produce two output files: popfilename-out.xml, popfilename-out.txt. A third output file will be created if you are using the Anthony Nolan HLA filter option for HLA data to check your input for valid/known HLA alleles: popfilename-filter.xml).

The popfilename-out.xml file is the primary output created by PyPop and the human-readable popfilename-out.txt file is a summary of the complete XML output. It is generated from the XML output via XSLT (eXtensible Stylesheet Language for Transformations) using the default XSLT stylesheet text.xsl, which is located in the xslt directory. The XML output can be further transformed using customized XSLT stylesheets into other formats for input to statistical software (e.g., R/Splus, SAS) or other population genetic software (e.g., PHYLIP). The popmeta script (popmeta.bat on Windows, popmeta on GNU/Linux) calls on other XSLT stylesheets to aggregate results from a number of output XML files from individual populations into a set of tab-separated (TSV) files containing summary statistics. These TSV files can be directly imported into a spreadsheet or statistical software. This script will be further documented in the next release.

A typical PyPop run might take anywhere from a few of minutes to a few hours, depending on how large your data set is and who else is using the system at the same time. Note that performing the allPairwiseLDWithPermu test may take several days if you have highly polymorphic loci in your data set.