1. Read FASTA file:
(1) Method 1: Bio Library
from Bio import SeqIO
fa_seq = SeqIO.read("res/sequence1.fasta", "fasta")
seq = str(fa_seq.seq)
seqs = [fa.seq for fa in SeqIO.parse("res/multi.fasta", "fasta")]
(2) Method 2: pysam Library
fa = pysam.FastaFile(‘filename’)
fa.references
fa.lengths
seq = fa.fetch(reference=’chr21’, start=a, end=b)
seq = fa.fetch(reference=’chr21’)
2. Convert sequence to string: str (FA. SEQ)
print("G Counts: ", fa.count("G"))
print("reverse: ", fa[::-1])
print("Reverse complement: ", fa.complement())
3、pysam read bam file error:
File “pysam/libcalignmentfile.pyx”, line 742, in pysam.libcalignmentfile.AlignmentFile.__cinit__
File “pysam/libcalignmentfile.pyx”, line 952, in pysam.libcalignmentfile.AlignmentFile._open
File “pysam/libchtslib.pyx”, line 365, in pysam.libchtslib.HTSFile.check_truncation
OSError: no BGZF EOF marker; file may be truncated
The file can be read this way, i.e. by adding ignore_truncation=True:
bam = pysam.AlignmentFile(‘**.bam’, "rb", ignore_truncation=True)
4. Cram format file: it has the characteristics of high compression and has a higher compression rate than BAM. Most files in cram format may be used in the future
5. Comparison data (BAM/cram/SAM), variation data: (VCF/BCF)
6. Get each read
for read in bam:
read.reference_name # The name of the chromosome to which the reference sequence is compared.
read.pos # the position of the read alignment
read.mapq # the quality value of the read comparison
read.query_qualities # read sequence base qualities
read.query_sequence # read sequence bases
read.reference_length # 在reference genome上read比对的长度
7. Read cram file
cf = pysam.AlignmentFile(‘*.cram’, ‘rc’)
8. Read SAM file
samfile = pysam.AlignmentFile(‘**.sam’, ‘r’)
9. Get a region in Bam file
for r in pysam.AlignmentFile(‘*.bam’, ‘rb’).fetch(‘chr21’, 300, 310):
pass # This is done provided that the *.bam file is indexed