I have created a simple set of python functions that let you simulate genomic sequence and create "mapping" files. You can insert random sequence however you wish and generate a set of reads at a particular coverage level. It will only output perfect mappings. The format of the read file should be pretty self explanatory. Each read belongs to a pair and each read has a unique id. An idea like "ABCDE/1" is paired with the read that has "ABCDE/2". The reads are always oriented so that read 1 is upstream of read 2. Each line of a read file gives read id, the sequence of the read, the insert length between the paired reads and the position the read maps to in the reference. If the read does not map, then the position is reported as "None". The insert length may also be reported as "None". This is because we only know the insert length from the mapping positions. If one read from a pair maps to nowhere, then we can't determine its insert length.

Here ( cs224ReadSimulation.tar.gz ) is a tarfile with all of the code and an example ( that shows how to do some basic stuff. Note: I fixed some problems so if you happened to d/l this before 4.23.09 you should get the new copy.

Its all pretty simple, but I could have missed something. If you find any bugs, please let me know.

The tarfile also includes some example sequence and reads.
refSeq00.fasta — a random reference sequence
refReads00.reads — reads generated with the unchanged reference sequence

insertSeqRef00.fasta — another random reference sequence
insertSeq00.fasta — a sequence in which the above ref (insertSeqRef00) has been given an insert at position 100
insertSeqReads00.reads — reads from the insertSeq00 sequence

— nick

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License