NGS Lab5 (基因组重测序-SNV calling)
相关软件
安装
( GATK4.6.1.0 的安装可参考我的这篇教程 )
使用 anaconda
创建虚拟环境
conda create -n snv_calling
# 等待一会儿后输入 y , 然后回车进入虚拟环境
conda activate snv_calling安装相关软件
conda install -c bioconda -c conda-forge samtools bcftools igv
# 等待一会儿后输入 y , 然后回车查看软件版本
conda list | grep -E "samtools|bcftools|igv"
# bcftools 1.21
# igv 2.17.4
# samtools 1.21使用 micromamba
创建虚拟环境
# 和 anaconda 不同, micromamba 在新建虚拟环境的时候需要加上 env
micromamba env create -n snv_calling进入虚拟环境
micromamba activate snv_calling安装相关软件
micromamba install -c bioconda -c conda-forge samtools bcftools igv
# 等待一会儿后输入 Y , 然后回车查看软件版本
micromamba list | grep -E "samtools|bcftools|igv"
# bcftools 1.21
# igv 2.17.4
# samtools 1.21任务一
数据:
/public/workspace/shaojf/Course/NGS/DataSets/Lab5/WXS-Case1.subsample.sam需求: (在主目录下下创建myNGS/Lab5,所有分析均放在该目录下)
用
samtools view把.sam文件转成.bam文件,并按坐标排序用
samtools markdup删除PCR重复用
samtools markdup标记PCR重复用
samtools flagstat对三个.bam文件进行简单统计用
samtools addreplacerg添加如下信息:ID:Case1_Run1LB:WXS PL:IlluminaSM:Case1
参考代码
0. 准备工作
新建文件夹并进入
mkdir ~/myNGS/Lab5 cd ~/myNGS/Lab5将
.sam文件软链接到你的文件夹ln -s /public/workspace/shaojf/Course/NGS/DataSets/Lab5/WXS-Case1.subsample.sam .
1. 将 .sam 转 .bam 并按坐标排序
.sam转.bamsamtools view -Sb ./WXS-Case1.subsample.sam -o WXS-Case1.subsample.bam -@ 20按坐标排序
# 若建索引,必须先按坐标排序 samtools sort ./WXS-Case1.subsample.bam -o ./sorted_WXS-Case1.subsample.bam -@ 20
2. 删除PCR重复
添加标签以供 markdup 使用
samtools fixmate -m WXS-Case1.subsample.bam fixmate_WXS-Case1.subsample.bam -@ 20markdup 需要按坐标排序
samtools sort fixmate_WXS-Case1.subsample.bam -o sorted_fixmate_WXS-Case1.subsample.bam -@ 20去除重复项
samtools markdup -r sorted_fixmate_WXS-Case1.subsample.bam dedup_WXS-Case1.subsample.bam -@ 20
3. 标记PCR重复
- 和去除重复项差不多,只是不需要加
-r参数samtools markdup sorted_fixmate_WXS-Case1.subsample.bam markdup_WXS-Case1.subsample.bam -@ 20
4. 简单统计三个 .bam
samtools flagstat sorted_WXS-Case1.subsample.bam > sorted_WXS-Case1.subsample.flagstat
samtools flagstat dedup_WXS-Case1.subsample.bam > dedup_WXS-Case1.subsample.flagstat
samtools flagstat markdup_WXS-Case1.subsample.bam > markdup_WXS-Case1.subsample.flagstat5. 添加信息
samtools addreplacerg -r ID:Case1_Run1 -r LB:WXS -r PL:Illumina -r SM:Case1 \
sorted_WXS-Case1.subsample.bam \
-o RG_sorted_WXS-Case1.subsample.bam -@ 20任务二
- 数据:任务 1 中已完成
markdup的 BAM 文件 - 要求:
- 使用
bcftools mpileup生成 VCF 格式 - 使用
bcftools call识别 SNV - 使用
bcftools filter过滤 SNV,保留满足以下条件的 SNV:QUAL > 10 && DP > 10
- 使用
参考代码
0. 准备工作
查看
.bam文件的历史记录,得知参考基因组及其序列samtools view -h markdup_WXS-Case1.subsample.bam | grep "fa" # 输出为: # @PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:bwa mem -t 60 -M -Y /public/workspace/shaojf/Course/NGS/Reference/BWA_Index/Homo_sapiens.GRCh38.dna.primary_assembly.fa WXS-Case1_R1.fq.gz WXS-Case1_R2.fq.gz # 得知参考基因组位于 /public/workspace/shaojf/Course/NGS/Reference/BWA_Index/Homo_sapiens.GRCh38.dna.primary_assembly.fa软链接参考基因组、索引以及字典文件
ln -s /public/workspace/shaojf/Course/NGS/Reference/BWA_Index/Homo_sapiens.GRCh38.dna.primary_assembly.fa . ln -s /public/workspace/shaojf/Course/NGS/Reference/BWA_Index/Homo_sapiens.GRCh38.dna.primary_assembly.fa.fai . ln -s /public/workspace/shaojf/Course/NGS/Reference/BWA_Index/Homo_sapiens.GRCh38.dna.primary_assembly.dict .设置环境变量
- Resource bundle 已经下好了
export GATK_bundle="/public/workspace/shaojf/Course/NGS/Reference/ftp.broadinstitute.org/bundle/hg38"
1. 生成vcf格式
# 输出为: markdup_WXS-Case1.subsample.vcf
bcftools mpileup -f ./Homo_sapiens.GRCh38.dna.primary_assembly.fa -o markdup_WXS-Case1.subsample.vcf markdup_WXS-Case1.subsample.bam --threads 202. 识别SNV
# 输出为: call_markdup_WXS-Case1.subsample.vcf
bcftools call -mv markdup_WXS-Case1.subsample.vcf -o call_markdup_WXS-Case1.subsample.vcf --threads 203. 过滤SNV
# 输出为 filter_WXS-Case1.subsample.vcf
bcftools filter -s LowQual -e 'QUAL>10 && DP>10' call_markdup_WXS-Case1.subsample.vcf -o filter_WXS-Case1.subsample.vcf --threads 20任务三
- 数据: 任务1 中第一步所完成的bam,即sam直接转成的bam文件
- 需求:
gatk AddOrReplaceReadGroups添加如下信息:ID:Case1_Run1 LB:WXS PL:Illumina PU:Run1 SM:Case1
gatk SortSam按坐标排序gatk MarkDuplicates标记PCR重复gatk BaseRecalibrator和gatk ApplyBQSR校正碱基质量gatk HaplotypeCaller识别SNVgatk NVScoreVariants和gatk FilterVariantTranches过滤SNV
参考代码
0. 准备工作
( GATK4.6.1.0 的安装可参考我的这篇教程 )
- 新建文件夹保存输出文件
mkdir gatk1. 添加信息
- 输入文件是
WXS-Case1.subsample.bam
gatk AddOrReplaceReadGroups -I ./WXS-Case1.subsample.bam \
-O ./gatk/RG_WXS-Case1.subsample.bam \
-ID Case1_Run1 -LB WXS -PL Illumina \
-PU Run1 -SM Case12. 按坐标排序
gatk SortSam \
-I RG_WXS-Case1.subsample.bam \
-O ./gatk/sorted_WXS-Case1.subsample.bam \
-SO coordinate3. 标记PCR重复
gatk MarkDuplicates -I ./gatk/sorted_WXS-Case1.subsample.bam \
-O ./gatk/markdup_WXS-Case1.subsample.bam \
-M ./gatk/WXS-Case1.subsample.duplicate_metrics.txt \
--CREATE_INDEX true4. 校正碱基质量
gatk BaseRecalibrator \
-R ./Homo_sapiens.GRCh38.dna.primary_assembly.fa \
-I ./gatk/markdup_WXS-Case1.subsample.bam \
--known-sites $GATK_bundle/1000G_phase1.snps.high_confidence.hg38.vcf \
--known-sites $GATK_bundle/Mills_and_1000G_gold_standard.indels.hg38.vcf \
--known-sites $GATK_bundle/dbsnp_146.hg38.vcf \
-O ./gatk/recal_data_WXS-Case1.subsample.table
gatk ApplyBQSR \
--bqsr-recal-file ./gatk/recal_data_WXS-Case1.subsample.table \
-R ./Homo_sapiens.GRCh38.dna.primary_assembly.fa \
-I ./gatk/markdup_WXS-Case1.subsample.bam \
-O ./gatk/BQSR_WXS-Case1.subsample.bam5. 识别SNV
gatk HaplotypeCaller \
-R ./Homo_sapiens.GRCh38.dna.primary_assembly.fa \
-I ./gatk/BQSR_WXS-Case1.subsample.bam \
-O ./gatk/WXS-Case1.subsample.HC.vcf.gz6. 过滤SNV
gatk NVScoreVariants \
-V ./gatk/WXS-Case1.subsample.HC.vcf.gz \
-R ./Homo_sapiens.GRCh38.dna.primary_assembly.fa \
-O ./gatk/WXS-Case1.subsample.annotated.vcf.gz
gatk IndexFeatureFile -I ./gatk/WXS-Case1.subsample.annotated.vcf.gz
gatk FilterVariantTranches \
-V ./gatk/WXS-Case1.subsample.annotated.vcf.gz \
--resource $GATK_bundle/hapmap_3.3.hg38.vcf \
--resource $GATK_bundle/Mills_and_1000G_gold_standard.indels.hg38.vcf \
--info-key CNN_1D \
--snp-tranche 99.95 \
--indel-tranche 99.4 \
-O ./gatk/WXS-Case1.subsample.annotated.filtered.vcf