Q-scores
Phred Quality Score, often shortened to Q-score, is a metric describing the probability of an incorrect base call in a sequence. The higher the Q-score, the better.
Q-score Q
is defined below where P
is the basecalling error probability:
Conversely:
For example, if the basecalling error probability P
is 1/1000 := 10^(-3)
then we get Q30.
Alternatively, if the basecalling accuracy A
is 99% we get Q20 as A = 1 - P
:
Q = -10 * log10 ( 1 - (99/100) )
= -10 * log10 ( 1 - 0.99 )
= -10 * log10 ( 0.01 )
= -10 * log10 ( 10^(-2) )
= -10 * (-2)
= 20
Q-string
Per-base Q-scores for a given sequence are encoded as ASCII characters
which forms a string. This is known as the Q-string. The encoding spans !
to ~
representing
0 to 93 (inclusive) respectively.
In practice, Dorado reports a maximum Q-score of 50 which is S
(upper case).
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
| ------------- Dorado Range [0, 50] ------------ |
To calculate the the Q-score from a Q-string character subtract 33
from the ASCII decimal value (ASCII(33) = !
).
Tip
In Python - use the ord
built-in to convert a ASCII character to its decimal value.
Mean Q-score calculation in Dorado
Given a Q-string which is generated by Dorado during basecalling and represents Dorado's confidence in the basecall, the mean Q-score calculation is performed as follows:
1. Trim the leading 60 bases if the sequence is longer than 60 bases.
2. Convert the Q-scores into error probabilities
3. Calculate the mean of the error probabilities
4. Convert the mean error probability into mean Q-score
The leading 60 bases are trimmed to account for the higher than normal noise at the beginning of sequencing.
Below is an example Python snippet calculating the mean Q-score from a Q-string.
import math
def mean_qscore(qstring: str) -> float:
"""Calculates mean Q-score from a Q-string"""
# Truncate string is sufficiently long
qstr = quality_string[60:] if len(quality_string) > 60 else quality_string
# Convert ASCII to Phred quality scores, then to estimated error
errors = [10**(-(ord(char) - 33)/10) for char in qstr]
# Calculate mean error
mean_error = sum(errors) / len(errors)
# Convert back to q-score
mean_qscore = -10 * math.log10(mean_error)
return mean_qscore
# Example quality string
quality_string = ";;;;;?<<<?GJI>>>>>>=<=<<=>>A@A??@==<8889?>=<<<;<<;;?==<;::;;<SS:::7666444<2+')933;<<=???CA?>=>=>>??51)))+&&(('&&&&''))HELLO;READER;'()))))&%%$"
print(mean_qscore(quality_string))
# 10.380942270316051