Parse Illumina sample sheets with Python
You can now do:
content = SampleSheet(filename).to_json()
And on the CLI:
❯ sample-sheet to_json paired-end-single-index.csv | jq
{
"Header": {
"IEM1FileVersion": "4",
"Investigator Name": "jdoe",
"Experiment Name": "exp001",
"Date": "11/16/2017",
"Workflow": "SureSelectXT",
"Application": "NextSeq FASTQ Only",
"Assay": "SureSelectXT",
"Description": "A description of this flow cell",
"Chemistry": "Default"
},
"Reads": [
151,
151
],
"Settings": {
"CreateFastqForIndexReads": "1",
"BarcodeMismatches": "2"
},
"Data": [
{
"Sample_Project": "exp001",
"Description": "0.5x treatment",
"Reference_Name": "mm10",
"Sample_Name": "1823A-tissue",
"index": "GAATCTGA",
"Library_ID": "2017-01-20",
"Read_Structure": "151T8B151T",
"Sample_ID": "1823A",
"Target_Set": "Intervals-001"
},
...
]
}
We now adhere to the entire Illumina specification for sample sheets and support many short-read analysis platform sample sheet variants including NextSeq, TrueSeq, and NovaSeq.
Example of adding a user-defined section thanks to help from @slagelwa:
from sample_sheet import SampleSheet
sample_sheet = SampleSheet()
sample_sheet.add_section('Manifests')
# Add a key value pair!
sample_sheet.Manifests.Key1 = "value1"
The .write()
method will write each section out in the order they are defined between the [Reads]
and [Settings]
sections.
The validation criteria for sample collisions in the same sample sheet has been adjusted so that you can theoretically have the same Sample_ID
, Library_ID
, index
, and index2
in the sample sheet as long as they appear in a different Lane
only.
As requested in #33 by @reisingerf.
Sample sheets will always be written in a deterministc manner which will help with hashing changes.
You can perform a round-trip read, modify, write (example below) or create a sample sheet de novo by instantiating SampleSheet
and Sample
classes and modifying them directly as shown in the README
infile = 'https://raw.githubusercontent.com/clintval/sample-sheet/master/tests/resources/paired-end-single-index.csv'
sample_sheet = SampleSheet(infile)
with open('test.csv', 'w') as handle:
sample_sheet.write(handle)
❯ head <( https://raw.githubusercontent.com/clintval/sample-sheet/master/tests/resources/paired-end-single-index.csv )
[Header],,,,,,,,
IEM1FileVersion,4,,,,,,,
Investigator Name,jdoe,,,,,,,
Experiment Name,exp001,,,,,,,
Date,11/16/2017,,,,,,,
Workflow,SureSelectXT,,,,,,,
Application,NextSeq FASTQ Only,,,,,,,
Assay,SureSelectXT,,,,,,,
Description,A description of this flow cell,,,,,,,
Chemistry,Default,,,,,,,
❯ head test.csv
[Header],,,,,,,,
IEM1FileVersion,4,,,,,,,
Investigator Name,jdoe,,,,,,,
Experiment Name,exp001,,,,,,,
Date,11/16/2017,,,,,,,
Workflow,SureSelectXT,,,,,,,
Application,NextSeq FASTQ Only,,,,,,,
Assay,SureSelectXT,,,,,,,
Description,A description of this flow cell,,,,,,,
Chemistry,Default,,,,,,,
Code test coverage is now calculated on all branches and PRs.
Goal for this project will be sustaining at least 95% coverage with a target of 100%.
❯ ./sample-sheet/run-tests
Name Stmts Miss Cover
---------------------------------------------------
sample_sheet/__init__.py 1 0 100%
sample_sheet/_sample_sheet.py 280 0 100%
---------------------------------------------------
TOTAL 281 0 100%
OK! 58 tests, 0 failures, 0 errors in 0.0s
>>> sample_sheet.experimental_design
"""
| sample_id | sample_name | library_id | description |
|:------------|:--------------|:-------------|:-----------------|
| 1823A | 1823A-tissue | 2017-01-20 | 0.5x treatment |
| 1823B | 1823B-tissue | 2017-01-20 | 0.5x treatment |
❯ sample-sheet-summary paired-end-single-index.csv
┌Header─────────────┬─────────────────────────────────┐
│ iem1_file_version │ 4 │
│ investigator_name │ jdoe │
│ experiment_name │ exp001 │
│ date │ 11/16/2017 │
│ workflow │ SureSelectXT │
│ application │ NextSeq FASTQ Only │
│ assay │ SureSelectXT │
│ description │ A description of this flow cell │
│ chemistry │ Default │
└───────────────────┴─────────────────────────────────┘
...
extras_require
now works with$ pip install '.[test]'
Features:
smart_open
to open a file on S3, HDFS, WebHDFS, HTTP as well as local (compressed or not)Read_Structure
column can be inferred, then the structure is promoted to class ReadStructure
.Known bugs:
experimental_design()
was irreparably broken