gsalib tutorial

Introduction

This tutorial is adapded from What is the GATKReport file format? by Geraldine Van der Auwera.

A GATK Report is simply a text document that contains a well-formatted, easy to read representation of some tabular data. Many GATK tools output their results as GATK Reports. A report contains one or more individual GATK report tables.

Here’s a simple example (note that the format varies depending on the report version):

#:GATKReport.v1.0:2
#:GATKTable:true:2:9:%.18E:%.15f:;
#:GATKTable:ErrorRatePerCycle:The error rate per sequenced position in the reads
cycle  errorrate.61PA8.7         qualavg.61PA8.7
0      7.451835696110506E-3      25.474613284804366
1      2.362777171937477E-3      29.844949954504095
2      9.087604507451836E-4      32.875909752547310
3      5.452562704471102E-4      34.498999090081895
4      9.087604507451836E-4      35.148316651501370
5      5.452562704471102E-4      36.072234352256190
6      5.452562704471102E-4      36.121724890829700
7      5.452562704471102E-4      36.191048034934500
8      5.452562704471102E-4      36.003457059679770

#:GATKTable:false:2:3:%s:%c:;
#:GATKTable:TableName:Description
key    column
1:1000  T
1:1001  A
1:1002  C

This report contains two individual GATK report tables. A report file begins with a report header that contains the report version and, in later versions, the number of tables. Every table begins with a header for its metadata and then a header for its name and description. The next row contains the column names followed by the data.

The Python module gsalib allows you to load GATK Report files into Python/pandas DataFrames for further analysis. Here are the simple steps to get gsalib, install it, and load a report.

1. Get the gsalib module from PyPI

Install gsalib by running on the command line:

pip install gsalib

2. Start Python (or open a Python notebook in Jupyter)

$ python3
Python 3.6.4 (default, Dec 19 2017, 11:33:49)
[GCC 6.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>

3. Load the GatkReport object from gsalib

>>> from gsalib import GatkReport

4. Finally, load the GATKReport file and have fun

gsalib has one class, GatkReport, that is a dict-like container for all of the tables in a GATK Report file. The GatkReport.tables attribute is a key-value object where the key is the table name and the value is a pandas DataFrame that contains the table’s data. Note that if a report contains more than one table with the same name the keys will be uniquified as table, table.1, etc.:

>>> d = GatkReport('/path/to/gsalib/test/test_v1.0_gatkreport.table')
>>> for table in d.tables:
...     d[table].describe()


          cycle  errorrate.61PA8.7  qualavg.61PA8.7
count  9.000000           9.000000         9.000000
mean   4.000000           0.001595        33.581250
std    2.738613           0.002273         3.687989
min    0.000000           0.000545        25.474613
25%    2.000000           0.000545        32.875910
50%    4.000000           0.000545        35.148317
75%    6.000000           0.000909        36.072234
max    8.000000           0.007452        36.191048
           key column
count        3      3
unique       3      3
top     1:1000      A
freq         1      1

For more examples, see gsalib/examples, which contains:

reshape_concordance_table
Given a GATK Report generated by GATK GenotypeConcordance this function reshapes the concordance for a specified sample into a matrix with the EvalGenotypes in rows and the CompGenotypes in columns.
summarize_varianteval (Python3 only)
Summarize several tables produced by GATK VariantEval into a VariantEvalMetricsSummary table as described in (howto) Evaluate a callset with VariantEval.