modified: _posts/2015-11-17-my-oaths.md
[GalaxyBlog.git] / _posts / 2012-03-30-converting-fastq-to-fasta-with-simple-tools.md
blob1ff5f113c8b76af166e9f0a07770209dfd23ae37
1 ---
2 date: '2012-03-30 01:39:39'
3 layout: post
4 slug: converting-fastq-to-fasta-with-simple-tools
5 title: "Converting FASTQ to FASTA with simple tools"
6 description: ""
7 category: BioInformatics
8 tags: [Biology, Galaxy_Original, Linux, Tips, Scripts]
9 wordpress_id: '1150'
10 ---
11 {% include JB/setup %}
13 又一次遇到fq文件要跑BLAST来鉴定物种污染情况,不想输出fasta文件浪费盘阵空间,就上网查on-the-fly的方法。(当然是简单命令走管道的……)
15 Google指向了老地方:<br>
16 http://stackoverflow.com/questions/1542306/converting-fastq-to-fasta-with-sed-awk
18 Converting FASTQ to FASTA with SED/AWK
20 I have a data in that always comes in block of four in the following format (called FASTQ):
22     @SRR018006.2016 GA2:6:1:20:650 length=36 
23     NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGN 
24     +SRR018006.2016 GA2:6:1:20:650 length=36 
25     !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!+! 
26     @SRR018006.19405469 GA2:6:100:1793:611 length=36 
27     ACCCGCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC 
28     +SRR018006.19405469 GA2:6:100:1793:611 length=36 
29     7);;).;);;/;*.2>/@@7;@77<..;)58)5/>/ 
31 Is there a simple sed/awk/bash way to convert them into this format (called FASTA):
33     >SRR018006.2016 GA2:6:1:20:650 length=36 
34     NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGN 
35     >SRR018006.19405469 GA2:6:100:1793:611 length=36 
36     ACCCGCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC 
38 In principle we want to extract the first two lines in each block-of-4 and replace `@` with `>`.
40 * * *
42 下面是答案,不过Galaxy没直接选被打勾的那个……<br>
43 This is the fastest I've got, and I stuck it in my .bashrc file:
44 <pre lang="bash">alias fq2fa="awk '{print \">\" substr(\$0,2);getline;print;getline;getline}'"</pre>
45 即:
46 <pre lang="bash">awk '{print ">" substr($0,2);getline;print;getline;getline}'</pre>
47 It doesn't fail on the infrequent but not impossible quality lines that start with @... but does fail on wrapped FASTQ, if that's even legal (it exists though).
49 另外有两个功能比较多的相关程序:
51 <ul>
52         <li><a href="http://hannonlab.cshl.edu/fastx_toolkit/commandline.html#fastq_to_fasta_usage">fastq_to_fasta from FASTX-Toolkit (FASTQ/A short-reads pre-processing tools)</a></li>
53         <li><a href="http://prinseq.sourceforge.net/">prinseq (Easy and rapid quality control and data preprocessing)</a></li>
54 </ul>
56 目前Galaxy断网中,有网且有空时再去考据这两软件。