常见生物数据格式01-FASTA/FASTQ

我们不生产数据，只是数据的搬运工。

在进行生物数据分析之前，有必要先了解一下常见的生物数据格式，不然可能会犯一些意料不到的错误。

常见的生物数据格式有：

记录DNA/Protein序列的文本格式，其由两部分组成:

记录测序序列及其质量得分的文本格式，序列与质量得分用单个ASCII字符表示。

Phred64编码格式，碱基质量值为字符的十进制ASCII码减去64

Phred32编码格式，碱基质量值为字符的十进制ASCII码减去32

通常一个序列由4行组成：

@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

@HWUSI-EAS100R:6:73:941:1973#0/1

HWUSI-EAS100R	the unique instrument name
6	flowcell lane
73	tile number within the flowcell lane
941	‘x’-coordinate of the cluster within the tile
1973	‘y’-coordinate of the cluster within the tile
#0	index number for a multiplexed sample (0 for no indexing)
/1	the member of a pair, /1 or /2 (paired-end or mate-pair reads only)

Illumina 1.4之后multiplex ID使用#NNNNNN替代#0，#NNNNNN为sequence of the multiplex tag。

@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG

EAS139	the unique instrument name
136	the run id
FC706VJ	the flowcell id
2	flowcell lane
2104	tile number within the flowcell lane
15343	‘x’-coordinate of the cluster within the tile
197393	‘y’-coordinate of the cluster within the tile
1	the member of a pair, 1 or 2 (paired-end or mate-pair reads only)
Y	Y if the read is filtered, N otherwise
18	0 when none of the control bits are on, otherwise it is an even number
ATCACG	index sequence

最近版本的Illumina输出格式为：

@EAS139:136:FC706VJ:2:2104:15343:197393 1:N:18:1

测序仪测序是根据荧光信号强弱给出参考测序错误概率(probability that the corresponding base call is incorrect,P)。

Sanger使用：$Q_{sanger}=-10*\log_{10}P$，其也称为Phred质量得分

Solexa(Illumina Genome Analyzer)使用：$Q_{solexa-prioor to v.1.3}=-10*\log_{10}\frac{P}{1-P}$

二者区别如下图所示：

通过上图，可以做类似这种决策树：

包含字符0：

不包含字符0:

可以使用Github的guess-encoding.py来进行判断

gunzip -c file.fastq.gz | awk 'NR % 4 == 0' | head -n 1000000 | python guess-encoding.py