Serious problems were observed.×
{% for k, v in errors.items() %}
{{k}} -> {{v}}
{% endfor %}
{% endif %}
{% if warns %}
Please be aware of some issues. Quality of downstream analysis might be affected.×
{% for k, v in warns.items() %}
{{k}} -> {{v}}
{% endfor %}
{% endif %}
Sample name
Name specified by -s option
Yield
The number of total bases
Number of reads
The number of total reads
Q7 bases
The fraction of bases having 7 or higher QV
Longest read
Length of the longest read
Estimated non-sense read fraction
The fraction of random or quite diverged sequences. Should be less than 30%. These are unmappable to target sample genome and also any contaminated genomes. If there is an unignorable gap between this and mapping rate, please be aware of contamination possibility.
General statistics
{% for key, value in stats.items() %}
{{ key }}
{{ value }}
{% endfor %}
{% if ad %}
{% if pb %}
{% else %}
{% endif %}
This panel is hidden in default for PacBio data as PacBio sequencers automatically trim adapter sequencers. Adapter sequences for PacBio are stored in their raw data file such as bax5 or scraps.bam.
Number of trimmed reads
The number of reads having adapter like (75% or higher identity) sequences in either terminals. If this is unexpectedly low and trimming was not conducted, it infers that adapter ligation step had some problems.
Max seq identity
Maximum value of identity between adapter sequence and sequences. This value should be quite high (90%) if adapter still exists in a dataset.
Average trimmed length
The average end position of aligned sequences. This should be consisent with the kit description and peak in the flanking region analysis plots.
Adapter statistics
{% for key, value in ad.items() %}
{{ key }}
{{ value }}
{% endfor %}
{% endif %}
{% if rl %}
This panel shows a typical length distribution for reads. Typical genome sequencing data from third generation sequencers show unimodal exponential
distribution, therefore, alpha parameter of Gamma dstribution ranges < 2.
Transcript sequencing, strictly size selected or highly fragmented data show higher alpha value shifting to right.
Mean read length
Expected read length from the sample data
N50
This is N50 of sample reads.
Read length
{% for key, value in rl.stats.items() %}
{{ key }}
{{ value }}
{% endfor %}
{% endif %}
{% if rq %}
{% if sequel %}
{% else %}
{% endif %}
This panel shows distibution of QV per read if QV is given in the file. Threshold is set to 7.
Ideally, both short and long reads should have similar distributions and median should be higher than 7.
It is noteworthy that x-axis is not the positions of reads but binned length of reads.
Per read QV
{% endif %}
{% if rc %}
Per-read coverage section presents coverage stats computed on the subsampled reads.
Per read coverage distribution
The first plot is an overview of per read coverage. If the dataset has no issue, single peak shall be observed except metagenomic samples.
LongQC automatically detects such a peak using GMM (for genome) or mixture of Gaussian and lognorm distribution (for transcriptome) to discriminate the true peak from the background level.
Mean/Median is then used for rough genome/transcriptome estimation.
If there are multiple peaks and the library is not metagenomic, you will observe overdispersion of coverage in further analysis.
Read coverage over different length reads
The plot in the middle is prepared to check if there is unexpected fluctuation of coverage. In genome sequencing data, fluctuation is supposed to be within a certain range (3 sigma as default). If you observe some singnificant fluctuations, it would be a signal of some issues.
We confimed that such flucatuation tells contamination, low quality library, overloading in PacBio etc.
If the data includes similar size genomes, fluctuation should be small.
QV for normal and non-sense reads
Box plot for normal reads should show higher value than that of non-sense reads.
Ideally, median of non-sense reads (orange line) should be in the red region.
If two boxes are close to each other, there are two cases.
Case 1: medians for both normal and non-sense shit in green area. This infers that coverage may be quite low.
Because non-sense read group include lots of mappable reads, average QV for non-sense read bacomes high in such a case.
Case 2: medians for both normal and non-sense shit in red area. This infers that dataset is so noisy that further analysis can be affected badly.
All in all, if two boxes are close, please carefully check the coverage plot.
Lastly, Sequel datasets cannot generate QV plot because there is no Phred score for Sequel at present.
Genomes having relatively long repeats:
Default configuration of LongQC can cope with short repeats or simple repeats, however, some complicated genomes like plant genomes sometimes show long-tail in Per read coverage distributions. Also, the plot for per read coverage over different length reads can fluctuate.
We're working on this known issue, and we plan to update LongQC as soon as we implement the code.
Note:
It is also noteworthy that mean/median coverage shown in this section can be smaller than the result using references. Because mapping reads onto uncorrected error-prone sequences is less sensitive, coverage is affected by such less sensitive result.
Estimated genome/transcriptome size tend to be bigger than the actual size because of the above effect.
The better data provides the better size estimation in general.
Per read coverage
{% for key, value in rc.stats.items() %}
{{ key }}
{{ value }}
{% endfor %}
{% endif %}
{% if gc %}
GC content is shown in this panel. These distributions are computed from the same data.
Blue one comes from entire reads, and red one is computed from chunked subsequences.
Blue one should show sharper distribution, becase it should have smaller deviation because of longer sequences.
However, read level GC content distribution can look slightly different in another data.
Red one is more robust to sequencing or sample differences, and this should be comparable to other data if the same target (e.g. biological replicates) is sequenced.
Although GC content is not nessesarily consistent with Gaussian distribution, mean and standard deviation are shown.
GC contents
{% for key, value in gc.stats.items() %}
{{ key }}
{{ value }}
{% endfor %}
{% endif %}
{% if fr %}
These plots can be used to check ligation of specific sequences like adapters and removal of them.
If there is no artificial sequences like adapter, peak should be shown at 0 in both plots (and some steep slope from 0 can be observed).
Otherwise, characteristic pattern should be observed according to applications.
If adapter like sequences are observed, average length is plotted as a dashed red line.