mdciao.nomenclature.AlignerConsensus
- class mdciao.nomenclature.AlignerConsensus(maps, tops=None)
Use consensus labels for multiple sequence alignment.
Instead of doing an actual multiple sequence alignment, we can exploit the existing consensus labels to align residues across very different (=low sequence identity) sequences and, optionally, topologies.
Without topologies, the alignment via the consensus labels is limited to the reference UniProt residue sequence indices:
>>> import mdciao >>> # Get the consensus labels from the GPCRdb and store them in a dict >>> maps = { "OPS": mdciao.nomenclature.LabelerGPCR("opsd_bovin"), >>> "B2AR": mdciao.nomenclature.LabelerGPCR("adrb2_human"), >>> "MUOR": mdciao.nomenclature.LabelerGPCR("oprm_mouse")} >>> # Pass the maps to AlignerConsensus >>> AC = mdciao.nomenclature.AlignerConsensus(maps) >>> AC.AAresSeq consensus OPS B2AR MUOR 0 1.25x25 NaN Q26 NaN 1 1.26x26 NaN E27 NaN 2 1.27x27 NaN R28 NaN 3 1.28x28 E33 D29 NaN 4 1.29x29 P34 E30 M65 .. ... ... ... ... 285 8.55x55 V318 Q337 R348 286 8.56x56 T319 E338 E349 287 8.57x57 T320 L339 F350 288 8.58x58 L321 L340 C351 289 8.59x59 C322 C341 I352
You can also filter the aligment using the _match methods
>>> AC.AAresSeq_match("3.5*") consensus OPS B2AR MUOR 117 3.50x50 R131 R135 R165 118 3.51x51 Y132 Y136 Y166 119 3.52x52 F133 V137 I167 120 3.53x53 A134 V138 A168 .. ... ... ... ...
With topologies, e.g. coming from specific pdbs (or from your own MD-setups), the alignment can be expressed in terms of residue indices or Cɑ-atom indices of those specific topologies. In this example, we are loading directly from the PDB, but you could load your own files with your own setups:
>>> pdb3CAP = mdciao.cli.pdb("3CAP") >>> pdb3SN6 = mdciao.cli.pdb("3SN6") >>> pdbMUOR = mdciao.cli.pdb("6DDF") >>> AC = mdciao.nomenclature.AlignerConsensus(maps, >>> tops={ "OPS": pdb3CAP.top, >>> "B2AR": pdb3SN6.top, >>> "MUOR": pdbMUOR.top})
Zero-indexed, per-topology Cɑ atom indices:
>>> AC.CAidxs_match("3.5*") consensus OPS B2AR MUOR 114 3.50x50 1065 7835 5370 115 3.51x51 1076 7846 5381 116 3.52x52 1088 7858 5393 117 3.53x53 1095 7869 5401 [...]
Zero-indexed, per-topology residue serial indices:
>>> AC.residxs_match("3.5*") consensus OPS B2AR MUOR 114 3.50x50 134 1007 706 115 3.51x51 135 1008 707 116 3.52x52 136 1009 708 117 3.53x53 137 1010 709 .. ... ... ... ...
By default, the _match methods return only rows were all present systems (“OPS”,”B2AR”, “MUOR” in the example) have consensus labels. E.g. here we ask for TM5 residues present in all three systems:
>>> AC.AAresSeq_match("5.*") consensus OPS B2AR MUOR 167 5.35x36 N200 N196 E229 168 5.36x37 E201 Q197 N230 169 5.37x38 S202 A198 L231 .. ... ... ... ... 198 5.66x66 K231 K227 K260 199 5.67x67 E232 R228 S261 200 5.68x68 A233 Q229 V262
But you can get relax the match and get an overview of missing residues using omit_missing=False:
>>> AC.AAresSeq_match("5.*", omit_missing=False) consensus OPS B2AR MUOR 162 5.30x31 NaN NaN P224 163 5.31x32 NaN NaN T225 164 5.32x33 NaN NaN W226 165 5.33x34 NaN NaN Y227 166 5.34x35 N199 NaN W228 167 5.35x36 N200 N196 E229 .. ... ... ... ... 200 5.68x68 A233 Q229 V262 201 5.69x69 A234 L230 NaN 202 5.70x70 A235 Q231 NaN 203 5.71x71 Q236 K232 NaN 204 5.72x72 Q237 I233 NaN 205 5.73x73 NaN D234 NaN 206 5.74x74 NaN K235 NaN 207 5.75x75 NaN S236 NaN 208 5.76x76 NaN E237 NaN
Here, we see e.g. that “MUOR” has more residues present at the beginning of TM5 (first row, from P224@5.30x31 on) and also that e.g. “B2AR” has the longest TM5 (last row, until E237@5.76x76).
Finally, instead of selecting for labels, you can also select for systems, i.e. “Show me the systems that have my selection labels”. Here, we ask what systems have ‘5.70…5.79’ residues:
>>> AC.AAresSeq_match("5.7*", select_keys=True) consensus OPS B2AR 202 5.70x70 A235 Q231 203 5.71x71 Q236 K232 204 5.72x72 Q237 I233
You notice the “MUOR”-column is missing, because it doesn’t have ‘5.7*’ residues
- __init__(maps, tops=None)
- Parameters:
maps (dict) – Dictionary of “maps”, each one mapping residues to consensus labels. The keys of the dict can be arbitrary identifiers, to distinguish among the different systems, like different UniProt Codes, PDB IDs, or user specific for system-setups (WT vs MUT etc). The values in maps can be:
Type
LabelerGPCR
,LabelerCGN
, orLabelerKLIFS
Recommended option, the most succinct and versatile. Pass this object and the maps will get created internally on-the-fly either by calling
AA2conlab
(if no tops passed) or by callingtop2labels
(if tops were passed).Type dict.
Only works if tops is None. Keyed with residue names (AAresSeq) and valued with consensus labels. Useful if for some reason you want to modify the dicts that are created by
AA2conlab
before using them here.Type list
Only works if tops is not None. Zero-indexed with residue indices of the tops and valued with consensus labels. Useful if for some reason
top2labels
doesn’t work automatically on the tops, sincetop2labels
sometimes fails if it can’t cleanly align the consensus sequences to the residues in the tops.tops (dict or None, default is None) – A dictionary of
Topology
objects, which will allow to express the consensus alignment also in terms of residue indices and of atom indices, usingCAidxs
andresidxs
, respectively (otherwise these methods return None). If tops is present, self.keys will be in the same order as they appear in tops.
Methods
AAresSeq_match
([patterns, keys, ...])Filter the self.AAresSeq by rows and columns.
CAidxs_match
([patterns, keys, omit_missing, ...])Filter the self.CAidxs by rows and columns.
__init__
(maps[, tops])- Parameters:
maps (dict) -- Dictionary of "maps", each one mapping residues
residxs_match
([patterns, keys, ...])Filter the self.residxs by rows and columns.
sequence_match
([patterns, absolute])Matrix with the percentage of sequence identity within the set of the residues sharing consensus labels
Attributes
Consensus-label alignment expressed as residues
Consensus-label alignment expressed as atom indices of the Cɑ atoms of respective tops
The keys with which the maps and the tops were given at input
The dictionaries mapping residue names to consensus labels.
Consensus-label alignment expressed as zero-indexed residue indices of the respective tops
The topologies given at input
- property AAresSeq: DataFrame
Consensus-label alignment expressed as residues
Will have NaNs where residues weren’t found, i.e. a given map didn’t contain that consensus label
- Returns:
df
- Return type:
- AAresSeq_match(patterns=None, keys=None, omit_missing=True, select_keys=False) DataFrame
Filter the self.AAresSeq by rows and columns.
You can filter by consensus label using patterns and by system using keys.
By default, rows where None, or NaNs are present are excluded.
- Parameters:
patterns (str, default is None) – A list in CSV-format of patterns to be matched by the consensus labels. Matches are done using Unix filename pattern matching, and are allows for exclusion, e.g. “3.*,-3.5*.” will include all residues in TM3 except those in the segment 3.50…3.59
keys (list, default is None) – If only a sub-set of columns need to match, provide them here as list of strings. If None, all columns will be used.
select_keys (bool, default is False) – Use the patterns not only to select for rows but also to select for columns, i.e. for keys. Keys (=columns) not featuring any patterns will be dropped.
omit_missing (bool, default is True) – Omit rows with missing values,
- Returns:
df
- Return type:
- property CAidxs: DataFrame
Consensus-label alignment expressed as atom indices of the Cɑ atoms of respective tops
Will have NaNs where atoms weren’t found, i.e. a given map didn’t contain that consensus label
Returns None if no tops were given at input.
- Returns:
df
- Return type:
- CAidxs_match(patterns=None, keys=None, omit_missing=True, select_keys=False) DataFrame
Filter the self.CAidxs by rows and columns.
You can filter by consensus label using patterns and by system using keys.
By default, rows where None, or NaNs are present are excluded.
- Parameters:
patterns (str, default is None) – A list in CSV-format of patterns to be matched by the consensus labels. Matches are done using Unix filename pattern matching, and are allows for exclusion, e.g. “3.*,-3.5*.” will include all residues in TM3 except those in the segment 3.50…3.59H8
“G.S*” will include all beta-sheets
keys (list, default is None) – If only a sub-set of columns need to match, provide them here as list of strings. If None, all columns (except filter_on) will be used.
select_keys (bool, default is False) – Use the patterns not only to select for rows but also to select for columns, i.e. for keys. Keys (=columns) not featuring any patterns will be dropped.
omit_missing (bool, default is True) – Omit rows with missing values
- Returns:
df
- Return type:
- property keys: list
The keys with which the maps and the tops were given at input
- property maps: dict
The dictionaries mapping residue names to consensus labels.
- property residxs: DataFrame
Consensus-label alignment expressed as zero-indexed residue indices of the respective tops
Will have NaNs where residues weren’t found, i.e. a given map didn’t contain that consensus label
Returns None if no tops were given at input.
- Returns:
df
- Return type:
- residxs_match(patterns=None, keys=None, omit_missing=True, select_keys=False) DataFrame
Filter the self.residxs by rows and columns.
You can filter by consensus label using patterns and by system using keys.
By default, rows where None, or NaNs are present are excluded.
- Parameters:
patterns (str, default is None) – A list in CSV-format of patterns to be matched by the consensus labels. Matches are done using Unix filename pattern matching, and are allows for exclusion, e.g. “3.*,-3.5*.” will include all residues in TM3 except those in the segment 3.50…3.59
keys (list, default is None) – If only a sub-set of columns need to match, provide them here as list of strings. If None, all columns will be used.
select_keys (bool, default is False) – Use the patterns not only to select for rows but also to select for columns, i.e. for keys. Keys (=columns) not featuring any patterns will be dropped.
omit_missing (bool, default is True) – Omit rows with missing values.
- Returns:
df
- Return type:
- sequence_match(patterns=None, absolute=False) DataFrame
Matrix with the percentage of sequence identity within the set of the residues sharing consensus labels
The comparison is done between the reference consensus sequences in self.AAresSeq, i.e., independently of any tops that the user has provided.
Example: >>> AC.sequence_match(patterns=”3.5*”)
OPS B2AR MUOR
OPS 100% 29% 57% B2AR 29% 100% 43% MUOR 57% 43% 100%
Meaning, for the residues having consensus labels 3.50 to 3.59, B2AR and OPS have 29% identity, or OPS and MUOR 57%. You can express this as absolute nubmer of residues: >>> AC.match_percentage(“3.*”, absolute=True)
OPS B2AR MUOR
OPS 7 2 4 B2AR 2 7 3 MUOR 4 3 7
You can check which residues these are: >>> AC.AAresSeq_match(“3.5*”)
consensus OPS B2AR MUOR
117 3.50x50 R135 R131 R165 118 3.51x51 Y136 Y132 Y166 119 3.52x52 V137 F133 I167 120 3.53x53 V138 A134 A168 121 3.54x54 V139 I135 V169 122 3.55x55 C140 T136 C170 123 3.56x56 K141 S137 H171
You can see the two OPS/B2AR matches in 3.50x50 and 3.51x51
- Parameters:
patterns (str, default is None) – A list in CSV-format of patterns to be matched by the consensus labels. Matches are done using Unix filename pattern matching, and are allows for exclusion, e.g. “3.*,-3.5*.” will include all residues in TM3 except those in the segment 3.50…3.59
absolute (bool, default is False) – Instead of returning a percentage, return the nubmer of matching residues as integers
- Returns:
match – The matrix of sequence identity, for residues sharing consensus labels across the different systems.
- Return type:
- property tops: dict
The topologies given at input