作者,Evil Genius
我们来分享一下关于WES分析部分的内容,其中内容主要包括三部分
-
基础部分,包括WES数据的call SNP、INDEL、MSI、fusion、annovar、CNV等部分,我们不仅要会每个部分,也要构建全自动化脚本,实现一键式分析出结果,包括甄别配对样本和单肿瘤样本。
-
高级注释部分,基础部分当然annovar已经做了很多的注释了,但是仍然还差得远,还需要更多的数据库注释,其中最重要的就是OncoKB的自动化注释,还有有很多人工添加的用药位点,同样的道理,要实现一键式自动化出分析结果的内容,其中还有计算TMB等内容。
-
临检报告的出具,这部分是WES核心内容,分析数据的解读以及如何根据分析得到的数据进行临床用药指导,同样,我们需要代码帮我们实现一键式出结果,对于敏感位点要报出warning,人工审核。
实际过程中接收到的样本往往只有tumor样本,其中血液样本和组织样本的过滤条件稍有不同。
这一篇我们来实现第一部分,其中分析用到的软件大家可以自行查阅。我们从拿到数据开始。
第一步:数据质控
普通质控
fastp=fastp软件路径
fq1=tumor外显子read1
fq2=tumor外显子read2
$fastp -i fq1 -I fq2 -o tumor_clean.R1.fq.gz" -O \
tumor_clean.R2.fq.gz" \
-h tumor_fastp.html \
-j tumor_fastp.json -3 -5
实际分析过程则需要更多的过滤条件
fastp=fastp软件路径
$fastp -i normal_1.fastq.gz -o normal_1.depumi.fastq.gz \
-I normal_2.fastq.gz -O normal_2.depumi.fastq.gz \
-U --umi_loc=per_read --umi_len=5 --umi_skip=4 --umi_prefix=UMI \
-Q -L -A -3 -5 -h normal.html -j fastp/normal.json
$fastp -i tumor_1.fastq.gz -o tumor_1.depumi.fastq.gz -I tumor_2.fastq.gz \
-O tumor_2.depumi.fastq.gz \
-U --umi_loc=per_read --umi_len=5 --umi_skip=4 --umi_prefix=UMI \
-Q -L -A -3 -5 -h tumor.html -j tumor.json
第二步,数据比对
bwa=bwa路径
samtools=samtools软件路径
$bwa mem -t 12 ucsc.hg19.fasta参考基因组路径 \
tumor_clean.R1.fq.gz tumor_clean.R2.fq.gz \
-R "@RG\tID:tumor\tLB:tumor\tPL:ILLUMINA\tSM:tumor\tPU:ILLUMINA" \
| $samtools view -bS - -o tumor.raw.bam
这里面首先需要构建好参考基因组,通常是hg19的,还有就是-R的参数设置。示例是针对tumor样本,对于normal也一样的操作。-t参数根据服务器性能自行调整。
第三步,bam文件的sort和去重
gatk=gatk软件路径,推荐最新版倒退一两个版本
$samtools sort -@ 12 -m 10G tumor.raw.bam -o tumor.sort.bam
$gatk --java-options "-Xms5g -Xmx5g -Djava.io.tmpdir=临时路径" MarkDuplicates \
-I tumor.sort.bam \
-O tumor.markdup.bam \
-M tumor.markdup_metrics.txt
第四步,BQSR,其中bed文件一般试剂盒会提供
target=bed文件路径
bedtools=bedtools软件路径
DBSNP=dbSNP数据库路径
MILLS=千人计划基础SNP位点
INDELS=千人计划基础indel位点
cat $target | awk -v OFS="\t" -v gap=500 '{$2=$2-gap;$3=$3+gap;print $0}' | $bedtools merge -i - > ./targetplus_bed
$gatk BaseRecalibrator \
-R 参考基因组路径 \
-I tumor.markdup.bam \
-L ./targetplus_bed\
--known-sites $DBSNP \
--known-sites $MILLS \
--known-sites $INDELS \
-O tumor.recal_data
$gatk ApplyBQSR \
--bqsr-recal-file tumor.recal_data \
-L ./targetplus_bed\
-R 参考基因组路径 \
-I tumor.markdup.bam \
-O tumor.markdup.BQSR.bam
rm ./targetplus_bed -rf
需要注意的是分析必须配备bed文件,否则后续的数据解读会出问题。
如果有normal样本需要再来一次上述分析。
第五步,HaplotypeCaller,其中参数的设置需要大家研究一下
有normal样本
$gatk --java-options "-Xms5g -Xmx5g -Djava.io.tmpdir=临时路径" HaplotypeCaller \
-R 参考基因组路径 \
-I normal.markdup.BQSR.bam \
--output normal.raw.vcf \
-L $target \
-stand-call-conf 30.0 \
--max-mnp-distance 5 \
-RF GoodCigarReadFilter \
--dont-use-soft-clipped-bases \
--minimum-mapping-quality 20 \
--native-pair-hmm-threads 8
只有tumor样本
$gatk --java-options "-Xms5g -Xmx5g -Djava.io.tmpdir=临时路径" HaplotypeCaller \
-R 参考基因组路径 \
-I tumor.markdup.BQSR.bam \
--output tumor.raw.vcf \
-L $target \
-stand-call-conf 30.0 \
--max-mnp-distance 5 \
-RF GoodCigarReadFilter \
--dont-use-soft-clipped-bases \
--minimum-mapping-quality 20 \
--native-pair-hmm-threads 8
filter,这一步的参数选择前面已经介绍过,大家需要思考的是为什么只有tumor的时候也需要对tumor数据进行HaplotypeCaller分析
$gatk SelectVariants \
--select-type-to-exclude INDEL \
-V tumor.raw.vcf \
-O tumor.snp.vcf
$gatk SelectVariants \
--select-type-to-include INDEL \
-V tumor.raw.vcf \
-O tumor.indel.vcf
$gatk VariantFiltration \
-R 参考基因组路径 \
--filter-expression "QD < 2.0 || ReadPosRankSum < -20.0 || FS > 200.0 || SOR > 10.0" \
--filter-name "Standard" \
--filter-expression "DP <= 5" \
--filter-name "DP_5" \
--filter-expression "DP < 20" \
--filter-name "DP_20" \
-V tumor.indel.vcf \
-O tumor.indel.filter.vcf
$gatk VariantFiltration \
-R 参考基因组路径 \
--filter-expression "QD < 2.0 || MQ < 40.0 || FS > 60.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0 || SOR > 3.0" \
--filter-name "Standard" \
--filter-expression "DP <= 5" \
--filter-name "DP_5" \
--filter-expression "DP < 20" \
--filter-name "DP_20" \
-V tumor.snp.vcf \
-O tumor.snp.filter.vcf
第六步,mutect,af_only_gnomad这个文件是什么大家自行查阅,这里就不过多介绍了。
在使用GATK4 Mutect2软件分析候选体细胞变异时,最好使用Panel of Normals (PON)文件,以提高变异分析准确度。
GATK开发团队认为,在生成PON文件时,正常对照样本越多越好(至少40个),但是其实一些小样本(4-10个)也是可以的,有总比没有好。
only tumor
af_only_gnomad=af_only_gnomad路径
$gatk --java-options "-Xmx5g -Xms5g -Djava.io.tmpdir=临时路径" Mutect2 \
-R 参考基因组路径 \
-I tumor.markdup.BQSR.bam \
-tumor tumor \
-L $target \
--output tumor.raw.vcf \
--germline-resource $af_only_gnomad \
--af-of-alleles-not-in-resource 0.00003125 \
--f1r2-tar-gz tumor.f1r2.tar.gz \
--min-base-quality-score 20 \
--max-mnp-distance 5 \
--disable-read-filter MateOnSameContigOrNoMappedMateReadFilter \
--native-pair-hmm-threads 8 \
-G StandardMutectAnnotation \
--dont-use-soft-clipped-bases
$gatk LearnReadOrientationModel \
-I tumor.f1r2.tar.gz \
-O tumor.read-orientation-model.tar.gz
$gatk GetPileupSummaries \
-I tumor.markdup.BQSR.bam \
-L $small_exac_common \
-V $small_exac_common \
-O tumor.getpileupsummaries.table
$gatk CalculateContamination \
-I tumor.getpileupsummaries.table \
-O tumor_calculatecontamination.table
$gatk FilterMutectCalls \
-V tumor.raw.vcf \
-R 参考基因组路径 \
--ob-priors tumor.read-orientation-model.tar.gz \
-contamination-table tumor_calculatecontamination.table \
-O tumor.raw.filtered.vcf
有配对样本
$gatk --java-options "-Xmx5g -Xms5g -Djava.io.tmpdir=临时路径" Mutect2 \
-R 参考基因组路径 \
-I tumor.markdup.BQSR.bam \
-tumor tumor \
-I normal.markdup.BQSR.bam \
-normal normal \
-L $target \
--output tumor.raw.vcf \
--germline-resource $af_only_gnomad \
--af-of-alleles-not-in-resource 0.00003125 \
--f1r2-tar-gz tumor.f1r2.tar.gz \
--min-base-quality-score 20 \
--max-mnp-distance 5 \
--disable-read-filter MateOnSameContigOrNoMappedMateReadFilter \
--native-pair-hmm-threads 8 \
-G StandardMutectAnnotation \
--dont-use-soft-clipped-bases \
-RF NotSupplementaryAlignmentReadFilter
后面的filter过程跟第六步就是一致的。
当然这样的清洗还是不够,还需要对数据进行分析调整
Mutect2_Filter=python的mutect的清洗脚本
python $Mutect2_Filter tumor.raw.filtered.vcf
第七步,annovar,数据库filter和注释,需要提前下载好注释的数据库文件,注意这里对HaplotypeCaller和mutect的分析结果均要进行该操作,mutect文件就是tumor.raw.filtered.vcf,下面是示例。
humandb=数据库路径,就在annovar软件下面,常见的写在了protocol参数下面
table_annovar=annovar软件注释路径
perl $table_annovar \
--buildver hg19 \
--otherinfo \
--nastring . tumor.snp.filter.vcf $humandb \
-protocol refGene,dbnsfp31a_interpro,avsnp150,snp138,1000g2015aug_all,exac03,clinvar_20210501,intervar_20180118,cosmic95,dbnsfp30a \
-operation g,f,f,f,f,f,f,f,f,f \
--vcfinput --remove
分析得到的文件示例如下:

第八步,MSI
msisensor2=msisensor2软件路径
msisensor_pro=msisensor_pro软件路径
microsatellites=微卫星位点的基线文件(软件msisensor2)
microsatellites_pro=微卫星位点的基线文件(软件msisensor_pro)
modelhg19=models_hg19_GRCh37
####有normal样本
$msisensor msi \
-d $microsatellites \
-n normal.BQSR.bam \
-t tumor.BQSR.bam \
-o tumor.MSI
####only tumor
$msisensor_pro pro -d $microsatellites_pro -t tumor.markdup.BQSR.bam -o tumor.MSI
第九步,Fusion,这里大家要考虑为什么用factera的时候需要对数据重新比对。
factera=factera.pl路径
ref_bit=ucsc.hg19.2bit
exons_bed=外显子的bed文件
#####重新比对
$bwa mem -Y -t 8 参考基因组路径 tumor_clean.R1.fq.gz tumor_clean.R2.fq.gz -R "@RG\tID:$prefix_t\tLB:tumor\tPL:ILLUMINA\tSM:tumor\tPU:ILLUMINA" | $samtools view -bS - -o tumor.raw.bam
$samtools sort -@ 12 -m 20G tumor.raw.bam -o tumor.sort.bam
perl $factera -F -o ./ tumor.sort.bam $exons_bed $ref_bit $target
如果有normal样本的话需要再来一次这样的分析。
第10步,CNV
cnvkit=cnvkit软件路径
access_hg19=access.hg19.bed文件
###有配对样本
python3 $cnvkit batch \
tumor.markdup.bam \
--normal normal.markdup.bam \
--targets $target \
--annotate $refFlat \
--fasta 参考基因组数据库 \
--access $access_hg19 \
--output-reference tumor.reference.cnn \
--output-dir ./ \
--diagram --scatter
####只有tumor样本
reference_cnn=CNV基线文件
python3 $cnvkit batch \
tumor.markdup.bam \
--reference $reference_cnn \
--output-dir ./
第11步,bamdst,测序位点的突变内容
bamdst=bamdst软件路径
$bamdst -p $target tumor.markdup.BQSR.bam -o ./
当然我们实际分析过程不可能是这样的一步一步操作,需要一键式出所有结果,这就需要大家有更高的代码能力,下面提供一个示例,用法是
只有tumor样本
bash gatk.panel.sh sample.clean.R1.fastq.gz sample.clean.R2.fastq.gz sample(样本名称) 输出路径 -t bed文件 -c CNV基线文件
有配对样本
bash gatk.panel.sh sample.clean.R1.fastq.gz sample.clean.R2.fastq.gz sample(样本名称) 输出路径 -t bed文件 -n normal_fq1 -N normal_fq2
核心脚本gatk.panel.sh
以下内容为付费内容,定价 ¥2000.00点击修改
#!/bin/bash
####zhaoyunfei
###############################################################
###############################################################
###############################################################
#path of software
fastp=fastp软件路径
bwa=/home/jinsai/software/bwa/bwa-master/bwa软件路径
samtools=/home/jinsai/software/samtools/samtools-1.9/samtools软件路径
gatk1=gatk软件路径
annovar=table_annovar.pl软件路径
cnvkit=cnvkit.py软件路径
factera=factera.pl软件路径
bedtools=bedtools软件路径
sambamba=sambamba软件路径
msi=msisensor2软件路径
msi_model=models_hg19_GRCh37
msisensor_pro=msisensor_pro软件路径
###############################################################
#############################################################################################
#reference file
hg19=ucsc.hg19.fasta路径
Mills_and_1000G_gold_standard=Mills_and_1000G_gold_standard.indels.hg19.sites.clean.vcf路径
DBSNP=dbsnp_138.hg19.clean.vcf路径
INDELS=1000G_phase1.snps.high_confidence.hg19.sites.clean.vcf路径
MILLS=Mills_and_1000G_gold_standard.indels.hg19.sites.clean.vcf路径
cnvkit_reference=reference.cnn路径
af_only_gnomad=somatic-b37_af-only-gnomad.new.sites.vcf路径
Mutect2_Filter=Mutect2_Filter路径
humandb=humandb路径
ref_bit=ucsc.hg19.2bit路径
exons_bed=exons.bed路径
refFlat=refFlat.txt路径
access_hg19=access.hg19.bed路径
small_exac_common=somatic-b37_small_exac_common_3.new.vcf路径
#############################################################################################
absname(){
basepath=$(cd `dirname $1`; pwd)
basefile=$(basename $1)
echo $basepath"/"$basefile
}
function usage(){
echo -e "Usage: `basename $0` "'fq1 fq2 library out_dir'
echo -e " -n <normal fq1>"
echo -e " -N <normal fq2>"
echo -e " -m|--max_threads <Integer: Default value: 24. max threads >"
echo -e " -t|--target <target file, default:Panel.bed>"
echo -e " -c|--reference_cnn < reference_cnn file ,required = T > "
}
nor_fq1=''
nor_fq2=''
threads=24
ARGS=`getopt -o n:N:t:m:c:h --long a:,N:,target,max_threads,reference_cnn,help -n "$0" -- "$@"`
if [ $? != 0 ]; then
usage;
exit 1
fi
eval set -- "${ARGS}"
while true
do
case "$1" in
-h|--help)
usage;
exit;
;;
-n|--n)
case "$2" in
"")
shift 2
;;
*)
n_fq1=$2
shift 2;
;;
esac
;;
-N|--N)
case "$2" in
"")
shift 2
;;
*)
n_fq2=$2
shift 2;
;;
esac
;;
-t|--target)
case "$2" in
"")
shift 2
;;
*)
target=$2
shift 2;
;;
esac
;;
-m|--max_threads)
case "$2" in
"")
shift 2
;;
*)
threads=$2
shift 2;
;;
esac
;;
-c|--reference_cnn)
case "$2" in
"")
shift 2
;;
*)
reference_cnn=$2
shift 2;
;;
esac
;;
--)
shift
break
;;
*)
echo "Internal error!"
exit 1
;;
esac
done
if [[ $# != 4 ]]; then
usage;
exit
fi
if [[ -e $target ]]; then
target=`absname $target`;
else
echo "--target: No such file! $target";
usage;
exit;
fi
echo $target;
fq1=$1
if [[ -e $1 ]]; then
fq1=`absname $1`;
else
echo "No such file: $fq1";
exit;
fi
fq2=$2
if [[ -e $2 ]]; then
fq2=`absname $2`;
else
echo "No such file: $fq2"
exit;
fi
libname=$3
out_dir=$4
if [[ -d $out_dir ]]; then
out_dir=`absname $out_dir `
else
mkdir -p $out_dir
out_dir=`absname $out_dir `
fi
echo $out_dir
prefix_n=""
prefix_t=""
if [[ $n_fq1 != "" ]]; then
prefix_n=$libname"_N"
prefix_t=$libname"_T"
n_fq1=`absname $n_fq1`;
n_fq2=`absname $n_fq2`;
else
prefix_t=$libname
fi
echo "******* step 1: quality control *******"
log_dir=$out_dir/$libname".log"
mkdir -p $log_dir
step1_dir=$out_dir/$libname".QC"
mkdir -p $step1_dir
echo "****** clean tumor data ***********"
echo $fastp -i $fq1 -I $fq2 -o $step1_dir"/"$prefix_t"_clean.R1.fq.gz" -O $step1_dir"/"$prefix_t"_clean.R2.fq.gz" -h $step1_dir"/"$prefix_t"_fastp.html" -j $step1_dir"/"$prefix_t"_fastp.json" -Q -L -A -3 -5 | sh - > $log_dir"/"run_$prefix_t.fastp.sh.log 2>&1
if [[ $n_fq1 != "" ]]; then
echo " ==> clean normal data"
echo $fastp -i $n_fq1 -I $n_fq2 -o $step1_dir"/"$prefix_n"_clean.R1.fq.gz" -O $step1_dir"/"$prefix_n"_clean.R2.fq.gz" -h $step1_dir"/"$prefix_n"_fastp.html" -j $step1_dir"/"$prefix_n"_fastp.json" -Q -L -A -3 -5 | sh - > $log_dir"/"run_$prefix_n.fastp.sh.log 2>&1
fi
echo "******* step 2: bwa *******"
step2_dir=$out_dir/$libname".mapping"
mkdir -p $step2_dir
echo "****** 2.1 mapping sequence: tumor[$prefix_t] bwa ******"
bash_str1="$bwa mem -t 24 -R \"@RG\\tID:$prefix_t\\tLB:$prefix_t\\tPL:ILLUMINA\\tSM:$prefix_t\\tPU:ILLUMINA\" $db $step1_dir"/"$prefix_t"_clean.R1.fq.gz" $step1_dir"/"$prefix_t"_clean.R2.fq.gz" | $samtools view -bS - -o $step2_dir/$prefix_t.raw.bam "
echo $bash_str1
echo $bash_str1 | sh - > $log_dir/run_$prefix_t.map.log 2>&1
mstr=`expr 12 \* 2`"G"
echo "****** 2.2 mapping sequence: tumor[$prefix_t] *******"
$samtools sort -@ 16 -m 10G $step2_dir/$prefix_t.raw.bam -o $step2_dir/$prefix_t.sort.bam >> $log_dir/run_$prefix_t.map.log 2>&1
echo "******* 2.2 mapping sequence: tumor[$prefix_t] MarkDuplicates ******"
$gatk --java-options "-Xms5g -Xmx5g -Djava.io.tmpdir=$step2_dir" MarkDuplicates \
-I $step2_dir/$prefix_t.sort.bam \
-ASO coordinate \
-O $step2_dir/$prefix_t.markdup.bam \
-M $step2_dir/$prefix_t.markdup_metrics.txt >> $log_dir/run_$prefix_t.map.log 2>&1
$samtools index $step2_dir/$prefix_t.markdup.bam >> $log_dir/run_$prefix_t.map.log 2>&1
echo "****** 2.3 mapping sequence: tumor[$prefix_t] BQSR ******"
tmp_bed=$step2_dir/$prefix_t.tmp.bed
cat $target | awk -v OFS="\t" -v gap=500 '{$2=$2-gap;$3=$3+gap;print $0}' | $bedtools merge -i - > $tmp_bed
$gatk BaseRecalibrator \
-R $db \
-I $step2_dir/$prefix_t.markdup.bam \
-L $tmp_bed \
--known-sites $DBSNP \
--known-sites $MILLS \
--known-sites $INDELS \
-O $step2_dir/$prefix_t.recal_data >> $log_dir/run_$prefix_t.map.log 2>&1
$gatk ApplyBQSR \
--bqsr-recal-file $step2_dir/$prefix_t.recal_data \
-L $tmp_bed \
-R $db \
-I $step2_dir/$prefix_t.markdup.bam \
-O $step2_dir/$prefix_t.markdup.BQSR.bam >> $log_dir/run_$prefix_t.map.log 2>&1
rm $tmp_bed
if [[ $n_fq1 != "" ]]; then
echo "****** mapping sequence: normal[$prefix_n] bwa *******"
bash_str2=`echo "$bwa mem -t 12 $db $step1_dir"/"$prefix_n"_clean.R1.fq.gz" $step1_dir"/"$prefix_n"_clean.R2.fq.gz" -R \"@RG\\tID:$prefix_n\\tLB:$prefix_n\\tPL:ILLUMINA\\tSM:$prefix_n\\tPU:ILLUMINA\" | $samtools view -bS - -o $step2_dir/$prefix_n.raw.bam "`
echo $bash_str2
echo $bash_str2 | sh - > $log_dir/run_$prefix_n.map.log 2>&1
echo " ==> 2.4 mapping sequence: normal[$prefix_n] sort"
mstr=`expr 12 \* 2`"G"
$samtools sort -@ 12 -m 10G $step2_dir/$prefix_n.raw.bam -o $step2_dir/$prefix_n.sort.bam >> $log_dir/run_$prefix_n.map.log 2>&1
echo "****** 2.5 mapping sequence: normal[$prefix_n] MarkDuplicates ****************"
$gatk --java-options "-Xms5g -Xmx5g -Djava.io.tmpdir=$step2_dir" MarkDuplicates \
-I $step2_dir/$prefix_n.sort.bam \
-ASO coordinate \
-O $step2_dir/$prefix_n.markdup.bam \
-M $step2_dir/$prefix_n.markdup_metrics.txt >> $log_dir/run_$prefix_n.map.log 2>&1
$samtools index $step2_dir/$prefix_n.markdup.bam >> $log_dir/run_$prefix_n.map.log 2>&1
echo "*************** 2.6 mapping sequence: normal[$prefix_n] BQSR *************"
tmp_bed=$step2_dir/$prefix_n.tmp.bed
cat $target | awk -v OFS="\t" -v gap=500 '{$2=$2-gap;$3=$3+gap;print $0}' | $bedtools merge -i - > $tmp_bed
$gatk BaseRecalibrator \
-R $db \
-I $step2_dir/$prefix_n.markdup.bam \
-L $tmp_bed \
--known-sites $DBSNP \
--known-sites $MILLS \
--known-sites $INDELS \
-O $step2_dir/$prefix_n.recal_data >> $log_dir/run_$prefix_n.map.log 2>&1
$gatk ApplyBQSR \
--bqsr-recal-file $step2_dir/$prefix_n.recal_data \
-L $tmp_bed \
-R $db \
-I $step2_dir/$prefix_n.markdup.bam \
-O $step2_dir/$prefix_n.markdup.BQSR.bam >> $log_dir/run_$prefix_n.map.log 2>&1
rm $tmp_bed
fi
echo "******* step 3: call ******"
step3_dir=$out_dir/$libname".GATK"
dir3_gatk=$step3_dir/gatk
mkdir -p $dir3_gatk
if [[ $n_fq1 != "" ]]; then
$gatk --java-options "-Xms5g -Xmx5g -Djava.io.tmpdir=$dir3_gatk" HaplotypeCaller \
-R $db \
-I $step2_dir/$prefix_n.markdup.BQSR.bam \
--output $dir3_gatk/$libname.raw.vcf \
-L $target \
-stand-call-conf 30.0 \
--max-mnp-distance 5 \
-RF GoodCigarReadFilter \
--dont-use-soft-clipped-bases \
--minimum-mapping-quality 20 \
--native-pair-hmm-threads 24 > $log_dir/run_$libname.gatk.log 2>&1
else
$gatk --java-options "-Xms5g -Xmx5g -Djava.io.tmpdir=$dir3_gatk" HaplotypeCaller \
-R $db \
-I $step2_dir/$prefix_t.markdup.BQSR.bam \
--output $dir3_gatk/$libname.raw.vcf \
-L $target \
-stand-call-conf 30.0 \
--max-mnp-distance 5 \
-RF GoodCigarReadFilter \
--dont-use-soft-clipped-bases \
--minimum-mapping-quality 20 \
--native-pair-hmm-threads 24 > $log_dir/run_$libname.gatk.log 2>&1
fi
$gatk SelectVariants \
--select-type-to-exclude SNP \
-V $dir3_gatk/$libname.raw.vcf \
-O $dir3_gatk/$libname.snp.vcf >> $log_dir/run_$libname.gatk.log 2>&1
$gatk SelectVariants \
--select-type-to-include INDEL \
-V $dir3_gatk/$libname.raw.vcf \
-O $dir3_gatk/$libname.indel.vcf >> $log_dir/run_$libname.gatk.log 2>&1
$gatk VariantFiltration \
-R $db \
--filter-expression "QD < 2.0 || ReadPosRankSum < -20.0 || FS > 200.0 || SOR > 10.0" \
--filter-name "Standard" \
--filter-expression "DP <= 5" \
--filter-name "DP_5" \
--filter-expression "DP < 20" \
--filter-name "DP_20" \
-V $dir3_gatk/$libname.indel.vcf \
-O $dir3_gatk/$libname.indel.filter.vcf >> $log_dir/run_$libname.gatk.log 2>&1
$gatk VariantFiltration \
-R $db \
--filter-expression "QD < 2.0 || MQ < 40.0 || FS > 60.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0 || SOR > 3.0" \
--filter-name "Standard" \
--filter-expression "DP <= 5" \
--filter-name "DP_5" \
--filter-expression "DP < 20" \
--filter-name "DP_20" \
-V $dir3_gatk/$libname.snp.vcf \
-O $dir3_gatk/$libname.snp.filter.vcf >> $log_dir/run_$libname.gatk.log 2>&1
echo "**************** step 3.2.call: mutect2 **************"
dir3_mutect=$step3_dir/mutect
mkdir -p $dir3_mutect
if [[ $n_fq1 == "" ]]; then # single sample
echo "************ step 3.2.call: mutect <only tumor> ************"
echo "$gatk --java-options \"-Xmx5g -Xms5g -Djava.io.tmpdir=$dir3_mutect\" Mutect2 \\
-R $db \\
-I $step2_dir/$prefix_t.markdup.BQSR.bam \\
-tumor $prefix_t \\
-L $target \\
--output $dir3_mutect/$prefix_t.raw.vcf \\
--germline-resource $af_only_gnomad \\
--af-of-alleles-not-in-resource 0.00003125 \\
--f1r2-tar-gz $dir3_mutect/$prefix_t.f1r2.tar.gz \\
--min-base-quality-score 20 \\
--disable-read-filter MateOnSameContigOrNoMappedMateReadFilter \\
--native-pair-hmm-threads 24 \\
-G StandardMutectAnnotation \\
--dont-use-soft-clipped-bases"
$gatk --java-options "-Xmx5g -Xms5g -Djava.io.tmpdir=$dir3_mutect" Mutect2 \
-R $db \
-I $step2_dir/$prefix_t.markdup.BQSR.bam \
-tumor $prefix_t \
-L $target \
--output $dir3_mutect/$libname.raw.vcf \
--germline-resource $af_only_gnomad \
--af-of-alleles-not-in-resource 0.00003125 \
--f1r2-tar-gz $dir3_mutect/$libname.f1r2.tar.gz \
--min-base-quality-score 20 \
--max-mnp-distance 5 \
--disable-read-filter MateOnSameContigOrNoMappedMateReadFilter \
--native-pair-hmm-threads 24 \
-G StandardMutectAnnotation \
--dont-use-soft-clipped-bases >> $log_dir/run_$libname.mutect2.log 2>&1
$gatk LearnReadOrientationModel \
-I $dir3_mutect/$libname.f1r2.tar.gz \
-O $dir3_mutect/$libname.read-orientation-model.tar.gz >> $log_dir/run_$libname.mutect2.log 2>&1
$gatk GetPileupSummaries \
-I $step2_dir/$prefix_t.markdup.BQSR.bam \
-L $small_exac_common \
-V $small_exac_common \
-O $dir3_mutect/$prefix_t.tumor_getpileupsummaries.table >> $log_dir/run_$libname.mutect2.log 2>&1
$gatk CalculateContamination \
-I $dir3_mutect/$prefix_t.tumor_getpileupsummaries.table \
-O $dir3_mutect/$prefix_t.tumor_calculatecontamination.table >> $log_dir/run_$libname.mutect2.log 2>&1
$gatk FilterMutectCalls \
-V $dir3_mutect/$prefix_t.raw.vcf \
-R $db \
--ob-priors $dir3_mutect/$libname.read-orientation-model.tar.gz \
-contamination-table $dir3_mutect/$libname.tumor_calculatecontamination.table \
-O $dir3_mutect/$prefix_t.raw.filtered.vcf >> $log_dir/run_$libname.mutect2.log 2>&1
ln -s $dir3_mutect/$libname.raw.filtered.vcf $dir3_mutect/$libname.filtered.vcf
else
echo "$gatk --java-options \"-Xmx5g -Xms5g -Djava.io.tmpdir=$dir3_mutect\" Mutect2 \\
-R $db \\
-I $step2_dir/$prefix_t.markdup.BQSR.bam \\
-tumor $prefix_t \\
-I $step2_dir/$prefix_n.markdup.BQSR.bam \\
-normal $prefix_n \\
-L $target \\
--output $dir3_mutect/$libname.raw.vcf \\
--germline-resource $af_only_gnomad \\
--af-of-alleles-not-in-resource 0.00003125 \\
--f1r2-tar-gz $dir3_mutect/$libname.f1r2.tar.gz \\
--min-base-quality-score 20 \\
--disable-read-filter MateOnSameContigOrNoMappedMateReadFilter \\
--native-pair-hmm-threads 8 \\
-G StandardMutectAnnotation \\
--dont-use-soft-clipped-bases \\
-RF NotSupplementaryAlignmentReadFilter"
$gatk --java-options "-Xmx5g -Xms5g -Djava.io.tmpdir=$dir3_mutect" Mutect2 \
-R $db \
-I $step2_dir/$prefix_t.markdup.BQSR.bam \
-tumor $prefix_t \
-I $step2_dir/$prefix_n.markdup.BQSR.bam \
-normal $prefix_n \
-L $target \
--output $dir3_mutect/$libname.raw.vcf \
--germline-resource $af_only_gnomad \
--af-of-alleles-not-in-resource 0.00003125 \
--f1r2-tar-gz $dir3_mutect/$libname.f1r2.tar.gz \
--min-base-quality-score 20 \
--max-mnp-distance 5 \
--disable-read-filter MateOnSameContigOrNoMappedMateReadFilter \
--native-pair-hmm-threads 8 \
-G StandardMutectAnnotation \
--dont-use-soft-clipped-bases \
-RF NotSupplementaryAlignmentReadFilter >> $log_dir/run_$libname.mutect2.log 2>&1
$gatk LearnReadOrientationModel \
-I $dir3_mutect/$libname.f1r2.tar.gz \
-O $dir3_mutect/$libname.read-orientation-model.tar.gz >> $log_dir/run_$libname.mutect2.log 2>&1
$gatk GetPileupSummaries \
-I $step2_dir/$prefix_n.markdup.BQSR.bam \
-L $small_exac_common \
-V $small_exac_common \
-O $dir3_mutect/$prefix_n.tumor_getpileupsummaries.table >> $log_dir/run_$libname.mutect2.log 2>&1
$gatk CalculateContamination \
-I $dir3_mutect/$prefix_n.tumor_getpileupsummaries.table \
-O $dir3_mutect/$prefix_n.tumor_calculatecontamination.table >> $log_dir/run_$libname.mutect2.log 2>&1
$gatk GetPileupSummaries \
-I $step2_dir/$prefix_t.markdup.BQSR.bam \
-L $small_exac_common \
-V $small_exac_common \
-O $dir3_mutect/$libname.tumor_getpileupsummaries.table >> $log_dir/run_$libname.mutect2.log 2>&1
$gatk CalculateContamination \
-I $dir3_mutect/$libname.tumor_getpileupsummaries.table \
-matched $dir3_mutect/$prefix_n.tumor_getpileupsummaries.table \
-O $dir3_mutect/$libname.tumor_calculatecontamination.table >> $log_dir/run_$libname.mutect2.log 2>&1
$gatk FilterMutectCalls \
-V $dir3_mutect/$libname.raw.vcf \
-R $db \
--ob-priors $dir3_mutect/$libname.read-orientation-model.tar.gz \
-O $dir3_mutect/$libname.raw.filtered.vcf >> $log_dir/run_$libname.mutect2.log 2>&1
ln -s $dir3_mutect/$libname.raw.filtered.vcf $dir3_mutect/$libname.filtered.vcf
fi
echo "$gatk CNNScoreVariants \\
-V $dir3_gatk/$libname.raw.vcf \\
-R $db \\
-O $dir3_gatk/$libname.anno.vcf"
echo "$gatk CNNScoreVariants \\
-V $dir3_gatk/$libname.raw.vcf \\
-R $db \\
-O $dir3_gatk/$libname.anno.vcf"
echo "******* step 4: filter *******"
step4_dir=$out_dir/$libname".ANNOVAR"
mkdir -p $step4_dir
echo "******* step 4.1.filter: gatk **************"
dir4_gatk=$step4_dir/gatk
mkdir -p $dir4_gatk
ln -s $dir3_gatk/$libname.snp.filter.vcf $dir4_gatk/$libname.snp.filter.vcf > $log_dir/run_$prefix_t.filt_gatk.log 2>&1
ln -s $dir3_gatk/$libname.indel.filter.vcf $dir4_gatk/$libname.indel.filter.vcf >> $log_dir/run_$prefix_t.filt_gatk.log 2>&1
sed -i --follow-symlinks 's/^chrM\t/chrMT\t/g' $dir4_gatk/$libname.snp.filter.vcf >> $log_dir/run_$prefix_t.filt_gatk.log 2>&1
perl $table_annovar \
--buildver hg19 \
--otherinfo \
--nastring . $dir4_gatk/$libname.snp.filter.vcf $humandb \
-protocol refGene,dbnsfp31a_interpro,avsnp150,snp138,1000g2015aug_all,exac03,clinvar_20210501,intervar_20180118,cosmic95,dbnsfp30a \
-operation g,f,f,f,f,f,f,f,f,f \
--vcfinput --remove >> $log_dir/run_$prefix_t.filt_gatk.log 2>&1
sed -i --follow-symlinks 's/^chrM\t/chrMT\t/g' $dir4_gatk/$libname.indel.filter.vcf
perl $table_annovar \
--buildver hg19 \
--otherinfo \
--nastring . $dir4_gatk/$libname.indel.filter.vcf $humandb \
-protocol refGene,dbnsfp31a_interpro,avsnp150,snp138,1000g2015aug_all,exac03,clinvar_20210501,intervar_20180118,cosmic95,dbnsfp30a \
-operation g,f,f,f,f,f,f,f,f,f \
--vcfinput --remove >> $log_dir/run_$prefix_t.filt_gatk.log 2>&1
echo "************** step 4.2.filter: mutect **************"
dir4_mutect=$step4_dir/mutect
mkdir -p $dir4_mutect
ln -s $dir3_mutect/$libname.raw.filtered.vcf $dir4_mutect/$libname.filter.vcf
sed -i --follow-symlinks 's/^chrM\t/chrMT\t/g' $dir4_mutect/$libname.filter.vcf
perl $table_annovar \
--buildver hg19 \
--otherinfo \
--nastring . $dir4_mutect/$libname.filter.vcf $humandb \
-protocol refGene,dbnsfp31a_interpro,avsnp150,snp138,1000g2015aug_all,exac03,clinvar_20210501,intervar_20180118,cosmic95,dbnsfp30a \
-operation g,f,f,f,f,f,f,f,f,f \
--vcfinput --remove > $log_dir/run_$prefix_t.filt_mutect2.log 2>&1
echo "**************** step 5: MSI *************"
step5_dir=$out_dir/$libname".MSI"
mkdir -p $step5_dir
$msisensor2 msi -M $modelhg19 -t $step2_dir/$libname.markdup.BQSR.bam -o $step5_dir/$libname.MSI > $log_dir/run_$prefix_n.msi.log 2>&1
echo "**************** step 6: fusion*************"
step6_dir=$out_dir/$libname".Fusion"
mkdir -p $step6_dir
bash_Fusion="$bwa mem -Y -t 8 $db $step1_dir"/"$prefix_t"_clean.R1.fq.gz" $step1_dir"/"$prefix_t"_clean.R2.fq.gz" -R \"@RG\\tID:$prefix_t\\tLB:$prefix_t\\tPL:ILLUMINA\\tSM:$prefix_t\\tPU:ILLUMINA\" | $samtools view -bS - -o $step6_dir/$prefix_t.raw.bam "
echo $bash_Fusion
echo $bash_Fusion | sh - > $log_dir/run_$prefix_t.fusion.log 2>&1
mstr=`expr 8 \* 2`"G"
$samtools sort -@ 8 -m 10G $step6_dir/$prefix_t.raw.bam -o $step6_dir/$prefix_t.sort.bam >> $log_dir/run_$prefix_t.fusion.log 2>&1
/home/jinsai/software/conda/miniconda/envs/wes/bin/perl $factera -F -r 2 -m 1 -x 200 -f 0.8 -S 0.8 -o $step6_dir $step6_dir/$prefix_t.sort.bam $exons_bed $ref_bit $target >> $log_dir/run_$prefix_t.fusion.log 2>&1
if [[ $n_fq1 != "" ]]; then
bash_Fusion="$bwa mem -Y -t 24 $db $step1_dir"/"$prefix_n"_clean.R1.fq.gz" $step1_dir"/"$prefix_n"_clean.R2.fq.gz" -R \"@RG\\tID:$prefix_n\\tLB:$prefix_n\\tPL:ILLUMINA\\tSM:$prefix_n\\tPU:ILLUMINA\" | $samtools view -bS - -o $step6_dir/$prefix_n.raw.bam "
echo $bash_Fusion
echo $bash_Fusion | sh - > $log_dir/run_$prefix_n.fusion.log 2>&1
mstr=`expr 8 \* 2`"G"
$samtools sort -@ 24 -m 10G $step6_dir/$prefix_n.raw.bam -o $step6_dir/$prefix_n.sort.bam >> $log_dir/run_$prefix_n.fusion.log 2>&1
/home/jinsai/software/conda/miniconda/envs/wes/bin/perl $factera -F -r 2 -m 1 -x 200 -f 0.8 -S 0.8 -o $step6_dir $step6_dir/$prefix_n.sort.bam $exons_bed $ref_bit $target>> $log_dir/run_$prefix_n.fusion.log 2>&1
fi
echo "**************** step 7: CNV*************"
step7_dir=$out_dir/$libname".CNV"
mkdir -p $step7_dir
if [[ $n_fq1 != "" ]]; then
/home/jinsai/software/conda/miniconda/envs/wes/bin/python $cnvkit batch \
$step2_dir/$prefix_t.markdup.bam \
--normal $step2_dir/$prefix_n.markdup.bam \
--targets $target \
--annotate $refFlat \
--fasta $db \
--access $access_hg19 \
--output-reference $step7_dir/$libname.reference.cnn \
--output-dir $step7_dir/ \
--diagram --scatter > $log_dir/run_$libname.cnv.log 2>&1
else
/home/jinsai/software/conda/miniconda/envs/wes/bin/python $cnvkit batch \
$step2_dir/$prefix_t.markdup.bam \
--reference $reference_cnn \
--output-dir $step7_dir/ > $log_dir/run_$prefix_t.cnv.log 2>&1
fi
echo "**************** step 8: bamdst*************"
step8_dir=$out_dir/$libname".bamdst"
mkdir -p $step8_dir
$bamdst -p $target $step2_dir/$prefix_t.markdup.BQSR.bam -o $step8_dir > $log_dir/run_$prefix_t.bamdst.log 2>&1
这篇博客主要分享了全外显子测序(WES)数据分析的流程,包括数据质控、比对、BAM处理、变异调用、注释、MSI、融合检测、CNV和突变分析等步骤。强调了自动化脚本的重要性,特别是对于单肿瘤样本和配对样本的处理。还提到了参考基因组构建、OncoKB注释、临床用药指导等方面的内容。
panel分析基础篇&spm=1001.2101.3001.5002&articleId=139188593&d=1&t=3&u=3e35817e85414b458352c2ad0ed186f8)
1981

被折叠的 条评论
为什么被折叠?



