API
- class seq_consensus.AlignmentFrequencies(sequences: Iterable[str] | None = None, alphabet_map: Mapping[str, str] = {'A': 'A', 'B': 'CGT', 'C': 'C', 'D': 'AGT', 'G': 'G', 'H': 'ACT', 'K': 'GT', 'M': 'AC', 'N': 'ACGT', 'R': 'AG', 'S': 'CG', 'T': 'T', 'V': 'ACG', 'W': 'AT', 'Y': 'CT'}, free_endgaps: bool = True, gap_char: str = '-', end_gap_char: str = '-')
This object holds column-wise letter frequencies, which are obtained from a multiple sequence alignment. The object can be constructed given a list or other iterator of aligned sequences, or sequences can be gradually added using add.
- Parameters:
sequences – iterable str expects equal sequence lengths, will fail otherwise
free_endgaps – If True, end gaps will not be taken into account. Note that if end_gap_char is different from gap_char and free_endgaps is not True, end gaps are converted to internal gaps. Either way, end_gap_char will never occur in the output.
alphabet_map – Alphabet map with keys being letters of the expected alphabet and values being the letters they translate to. This can either be the same letter (unambiguous) or a sorted string of ambiguous letters.
gap_char – Gap character (default: ‘-‘)
end_gap_char – End gap character in the sequences. (default: ‘-‘). If different from gap_char, it needs to be correctly specified. Otherwise, there will be an error.
- column_freqs() Iterable[Iterable[Tuple[str, int]]]
Returns an iterator over all columns, whereby each element is again an iterator over letters and frequencies.
Example
>>> from seq_consensus import AlignmentFrequencies >>> freqs = AlignmentFrequencies(['AG', 'AR']) >>> for colfreq in freqs.column_freqs(): >>> print(dict(colfreq)) {'A': 2} {'G': 1, 'R': 1}
- consensus(threshold: float = 0, strip_gaps: bool = False, gap_char_out: str = '-', end_gap_char_out: str = '-', maybe_gap_char: str = '?')
Calls the consensus sequence given the internal consensus matrix.
- coverage() array
Returns a numpy.array(dtype=numpy.float64) with the fraction of non-gap sequences at any position. Note that if the object was constructed with free_endgaps, terminal gaps will not count as gaps at all, potentially resulting in high coverage values.
- matrix() Tuple[str, array]
Returns a tuple of the letters (“row names” in the matrix) and the corresponding letter frequencies at each position. this includes all ambiguous letters as well, even if their frequency is zero.
- normalized_column_freqs() Iterable[Iterable[Tuple[str, int | float]]]
Works the same as column_freqs, but letter frequencies for ambiguities are split to the corresponding letters.
Example
>>> from seq_consensus import AlignmentFrequencies >>> freqs = AlignmentFrequencies(['AG', 'AR']) >>> for colfreq in freqs.normalized_column_freqs(): >>> print(dict(colfreq)) {'A': 2.0} {'A': 0.5, 'G': 1.5}
- exception seq_consensus.AlphabetLookupError(letters)
- seq_consensus.consensus(sequences: Iterable[str], threshold: float = 0, alphabet_map: Mapping[str, str] = {'A': 'A', 'B': 'CGT', 'C': 'C', 'D': 'AGT', 'G': 'G', 'H': 'ACT', 'K': 'GT', 'M': 'AC', 'N': 'ACGT', 'R': 'AG', 'S': 'CG', 'T': 'T', 'V': 'ACG', 'W': 'AT', 'Y': 'CT'}, free_endgaps: bool = True, strip_gaps: bool = False, gap_char: str = '-', end_gap_char: str = '-', gap_char_out: str = '-', end_gap_char_out: str = '-', maybe_gap_char: str = '?')
Calculates the consensus sequence from a multiple alignment, represented by an iterable of same-length strings. By default, DNA is assumed (sequences may contain IUPAC ambiguities). Based on the letter frequencies in a column and threshold, the consensus letter will be decided. Ambiguous letters (such as IUPAC degeneracy codes) are split into the corresponding letters, whereby each partially contributes (with frequency 1/N) to the letter frequency. For example, the DNA ambiguity “Y” contributes half to “A” and half to “T”.
The behaviour of this function is the same as in the Geneious software and very similar to the DECIPHER R package. Details at https://assets.geneious.com/manual/2022.0/static/GeneiousManualse45.html and http://www2.decipher.codes/index.html
- Parameters:
sequences (optional) – iterable str (expects equal sequence lengths, will fail otherwise)
threshold – Number between 0 and 1 indicating the proportion of all sequences that need the given letter in a column in order to be accepted as consensus letter. If the frequency is below the threshold, the consensus will be an ambiguity code representing a combination of letters whose cumulative frequency is above the threshold. With a threshold of 0, the most frequent base will always be chosen as consensus, while with a threshold of 1, 100% of the sequences need the same letter in order to obtain an unambiguous consensus.
alphabet_map – Alphabet map with keys being letters of the expected alphabet and values being the letters they translate to. This can either be the same letter (unambiguous) or a sorted string of ambiguous letters.
free_endgaps – If True, end gaps will not be taken into account when forming the consensus. If False, there will be no distinction between internal and end gaps, even if end_gap_char is different from gap_char. Also, end_gap_char_out will never appear in the output.
strip_gaps – If True, gaps will not be included in the consensus
gap_char – Gap character (default: ‘-‘).
end_gap_char – End gap character to be expected in the input (default: ‘-‘). If equal to gap_char (which is the default), end gaps will be automatically recognized when adding a sequence with add. If different from gap_char, end gaps will be assumed to be present in the input and no further end gap parsing is done. It is the users’s responsibility to make sure the terminal gaps are correctly annotated.
gap_char_out – Gap character to use in the consensus (default: ‘-‘)
end_gap_char_out – End gap character to use in the consensus (default: ‘-‘). This character will only be returned if free_endgaps is True, otherwise terminal gaps are treated like internal gaps.
maybe_gap_char – Character that will be set for columns that have mixed gaps and letters, where gaps are frequent enough that the situation is ambiguous at the given threshold. Only if the most frequent characters are all valid letters or all gaps, an unambiguous consensus call is possible.
- Returns (str):
The consensus sequence