mdciao.nomenclature.AlignerConsensus

class mdciao.nomenclature.AlignerConsensus(maps, tops=None)

Use consensus labels for multiple sequence alignment.

Instead of doing an actual multiple sequence alignment, we can exploit the existing consensus labels to align residues across very different (=low sequence identity) sequences and, optionally, topologies.

Without topologies, the alignment via the consensus labels is limited to the reference UniProt residue sequence indices:

>>> import mdciao
>>> # Get the consensus labels from the GPCRdb and store them in a dict
>>> maps = { "OPS": mdciao.nomenclature.LabelerGPCR("opsd_bovin"),
>>>         "B2AR": mdciao.nomenclature.LabelerGPCR("adrb2_human"),
>>>         "MUOR": mdciao.nomenclature.LabelerGPCR("oprm_mouse")}
>>> # Pass the maps to AlignerConsensus
>>> AC = mdciao.nomenclature.AlignerConsensus(maps)
>>> AC.AAresSeq
    consensus   OPS  B2AR  MUOR
0     1.25x25   NaN   Q26   NaN
1     1.26x26   NaN   E27   NaN
2     1.27x27   NaN   R28   NaN
3     1.28x28   E33   D29   NaN
4     1.29x29   P34   E30   M65
..        ...   ...   ...   ...
285   8.55x55  V318  Q337  R348
286   8.56x56  T319  E338  E349
287   8.57x57  T320  L339  F350
288   8.58x58  L321  L340  C351
289   8.59x59  C322  C341  I352

You can also filter the aligment using the _match methods

>>> AC.AAresSeq_match("3.5*")
    consensus   OPS  B2AR  MUOR
117   3.50x50  R131  R135  R165
118   3.51x51  Y132  Y136  Y166
119   3.52x52  F133  V137  I167
120   3.53x53  A134  V138  A168
..        ...   ...   ...   ...

With topologies, e.g. coming from specific pdbs (or from your own MD-setups), the alignment can be expressed in terms of residue indices or Cɑ-atom indices of those specific topologies. In this example, we are loading directly from the PDB, but you could load your own files with your own setups:

>>> pdb3CAP = mdciao.cli.pdb("3CAP")
>>> pdb3SN6 = mdciao.cli.pdb("3SN6")
>>> pdbMUOR = mdciao.cli.pdb("6DDF")
>>> AC = mdciao.nomenclature.AlignerConsensus(maps,
>>>                                           tops={ "OPS": pdb3CAP.top,
>>>                                                 "B2AR": pdb3SN6.top,
>>>                                                 "MUOR": pdbMUOR.top})

Zero-indexed, per-topology Cɑ atom indices:

>>> AC.CAidxs_match("3.5*")
    consensus   OPS  B2AR  MUOR
114   3.50x50  1065  7835  5370
115   3.51x51  1076  7846  5381
116   3.52x52  1088  7858  5393
117   3.53x53  1095  7869  5401
[...]

Zero-indexed, per-topology residue serial indices:

>>> AC.residxs_match("3.5*")
    consensus  OPS  B2AR  MUOR
114   3.50x50  134  1007   706
115   3.51x51  135  1008   707
116   3.52x52  136  1009   708
117   3.53x53  137  1010   709
..        ...   ...   ...   ...

By default, the _match methods return only rows were all present systems (“OPS”,”B2AR”, “MUOR” in the example) have consensus labels. E.g. here we ask for TM5 residues present in all three systems:

>>> AC.AAresSeq_match("5.*")
    consensus   OPS  B2AR  MUOR
167   5.35x36  N200  N196  E229
168   5.36x37  E201  Q197  N230
169   5.37x38  S202  A198  L231
..        ...   ...   ...   ...
198   5.66x66  K231  K227  K260
199   5.67x67  E232  R228  S261
200   5.68x68  A233  Q229  V262

But you can get relax the match and get an overview of missing residues using omit_missing=False:

>>> AC.AAresSeq_match("5.*", omit_missing=False)
    consensus   OPS  B2AR  MUOR
162   5.30x31   NaN   NaN  P224
163   5.31x32   NaN   NaN  T225
164   5.32x33   NaN   NaN  W226
165   5.33x34   NaN   NaN  Y227
166   5.34x35  N199   NaN  W228
167   5.35x36  N200  N196  E229
..        ...   ...   ...   ...
200   5.68x68  A233  Q229  V262
201   5.69x69  A234  L230   NaN
202   5.70x70  A235  Q231   NaN
203   5.71x71  Q236  K232   NaN
204   5.72x72  Q237  I233   NaN
205   5.73x73   NaN  D234   NaN
206   5.74x74   NaN  K235   NaN
207   5.75x75   NaN  S236   NaN
208   5.76x76   NaN  E237   NaN

Here, we see e.g. that “MUOR” has more residues present at the beginning of TM5 (first row, from P224@5.30x31 on) and also that e.g. “B2AR” has the longest TM5 (last row, until E237@5.76x76).

Finally, instead of selecting for labels, you can also select for systems, i.e. “Show me the systems that have my selection labels”. Here, we ask what systems have ‘5.70…5.79’ residues:

>>> AC.AAresSeq_match("5.7*", select_keys=True)
    consensus   OPS  B2AR
202   5.70x70  A235  Q231
203   5.71x71  Q236  K232
204   5.72x72  Q237  I233

You notice the “MUOR”-column is missing, because it doesn’t have ‘5.7*’ residues

__init__(maps, tops=None)
Parameters:
  • maps (dict) – Dictionary of “maps”, each one mapping residues to consensus labels. The keys of the dict can be arbitrary identifiers, to distinguish among the different systems, like different UniProt Codes, PDB IDs, or user specific for system-setups (WT vs MUT etc). The values in maps can be:

    Recommended option, the most succinct and versatile. Pass this object and the maps will get created internally on-the-fly either by calling AA2conlab (if no tops passed) or by calling top2labels (if tops were passed).

    • Type dict.

    Only works if tops is None. Keyed with residue names (AAresSeq) and valued with consensus labels. Useful if for some reason you want to modify the dicts that are created by AA2conlab before using them here.

    • Type list

    Only works if tops is not None. Zero-indexed with residue indices of the tops and valued with consensus labels. Useful if for some reason top2labels doesn’t work automatically on the tops, since top2labels sometimes fails if it can’t cleanly align the consensus sequences to the residues in the tops.

  • tops (dict or None, default is None) – A dictionary of Topology objects, which will allow to express the consensus alignment also in terms of residue indices and of atom indices, using CAidxs and residxs, respectively (otherwise these methods return None). If tops is present, self.keys will be in the same order as they appear in tops.

Methods

AAresSeq_match([patterns, keys, ...])

Filter the self.AAresSeq by rows and columns.

CAidxs_match([patterns, keys, omit_missing, ...])

Filter the self.CAidxs by rows and columns.

__init__(maps[, tops])

Parameters:
  • maps (dict) -- Dictionary of "maps", each one mapping residues

residxs_match([patterns, keys, ...])

Filter the self.residxs by rows and columns.

sequence_match([patterns, absolute])

Matrix with the percentage of sequence identity within the set of the residues sharing consensus labels

Attributes

AAresSeq

Consensus-label alignment expressed as residues

CAidxs

Consensus-label alignment expressed as atom indices of the Cɑ atoms of respective tops

keys

The keys with which the maps and the tops were given at input

maps

The dictionaries mapping residue names to consensus labels.

residxs

Consensus-label alignment expressed as zero-indexed residue indices of the respective tops

tops

The topologies given at input

property AAresSeq: DataFrame

Consensus-label alignment expressed as residues

Will have NaNs where residues weren’t found, i.e. a given map didn’t contain that consensus label

Returns:

df

Return type:

DataFrame

AAresSeq_match(patterns=None, keys=None, omit_missing=True, select_keys=False) DataFrame

Filter the self.AAresSeq by rows and columns.

You can filter by consensus label using patterns and by system using keys.

By default, rows where None, or NaNs are present are excluded.

Parameters:
  • patterns (str, default is None) – A list in CSV-format of patterns to be matched by the consensus labels. Matches are done using Unix filename pattern matching, and are allows for exclusion, e.g. “3.*,-3.5*.” will include all residues in TM3 except those in the segment 3.50…3.59

  • keys (list, default is None) – If only a sub-set of columns need to match, provide them here as list of strings. If None, all columns will be used.

  • select_keys (bool, default is False) – Use the patterns not only to select for rows but also to select for columns, i.e. for keys. Keys (=columns) not featuring any patterns will be dropped.

  • omit_missing (bool, default is True) – Omit rows with missing values,

Returns:

df

Return type:

DataFrame

property CAidxs: DataFrame

Consensus-label alignment expressed as atom indices of the Cɑ atoms of respective tops

Will have NaNs where atoms weren’t found, i.e. a given map didn’t contain that consensus label

Returns None if no tops were given at input.

Returns:

df

Return type:

DataFrame

CAidxs_match(patterns=None, keys=None, omit_missing=True, select_keys=False) DataFrame

Filter the self.CAidxs by rows and columns.

You can filter by consensus label using patterns and by system using keys.

By default, rows where None, or NaNs are present are excluded.

Parameters:
  • patterns (str, default is None) – A list in CSV-format of patterns to be matched by the consensus labels. Matches are done using Unix filename pattern matching, and are allows for exclusion, e.g. “3.*,-3.5*.” will include all residues in TM3 except those in the segment 3.50…3.59H8

    • “G.S*” will include all beta-sheets

  • keys (list, default is None) – If only a sub-set of columns need to match, provide them here as list of strings. If None, all columns (except filter_on) will be used.

  • select_keys (bool, default is False) – Use the patterns not only to select for rows but also to select for columns, i.e. for keys. Keys (=columns) not featuring any patterns will be dropped.

  • omit_missing (bool, default is True) – Omit rows with missing values

Returns:

df

Return type:

DataFrame

property keys: list

The keys with which the maps and the tops were given at input

property maps: dict

The dictionaries mapping residue names to consensus labels.

property residxs: DataFrame

Consensus-label alignment expressed as zero-indexed residue indices of the respective tops

Will have NaNs where residues weren’t found, i.e. a given map didn’t contain that consensus label

Returns None if no tops were given at input.

Returns:

df

Return type:

DataFrame

residxs_match(patterns=None, keys=None, omit_missing=True, select_keys=False) DataFrame

Filter the self.residxs by rows and columns.

You can filter by consensus label using patterns and by system using keys.

By default, rows where None, or NaNs are present are excluded.

Parameters:
  • patterns (str, default is None) – A list in CSV-format of patterns to be matched by the consensus labels. Matches are done using Unix filename pattern matching, and are allows for exclusion, e.g. “3.*,-3.5*.” will include all residues in TM3 except those in the segment 3.50…3.59

  • keys (list, default is None) – If only a sub-set of columns need to match, provide them here as list of strings. If None, all columns will be used.

  • select_keys (bool, default is False) – Use the patterns not only to select for rows but also to select for columns, i.e. for keys. Keys (=columns) not featuring any patterns will be dropped.

  • omit_missing (bool, default is True) – Omit rows with missing values.

Returns:

df

Return type:

DataFrame

sequence_match(patterns=None, absolute=False) DataFrame

Matrix with the percentage of sequence identity within the set of the residues sharing consensus labels

The comparison is done between the reference consensus sequences in self.AAresSeq, i.e., independently of any tops that the user has provided.

Example: >>> AC.sequence_match(patterns=”3.5*”)

OPS B2AR MUOR

OPS 100% 29% 57% B2AR 29% 100% 43% MUOR 57% 43% 100%

Meaning, for the residues having consensus labels 3.50 to 3.59, B2AR and OPS have 29% identity, or OPS and MUOR 57%. You can express this as absolute nubmer of residues: >>> AC.match_percentage(“3.*”, absolute=True)

OPS B2AR MUOR

OPS 7 2 4 B2AR 2 7 3 MUOR 4 3 7

You can check which residues these are: >>> AC.AAresSeq_match(“3.5*”)

consensus OPS B2AR MUOR

117 3.50x50 R135 R131 R165 118 3.51x51 Y136 Y132 Y166 119 3.52x52 V137 F133 I167 120 3.53x53 V138 A134 A168 121 3.54x54 V139 I135 V169 122 3.55x55 C140 T136 C170 123 3.56x56 K141 S137 H171

You can see the two OPS/B2AR matches in 3.50x50 and 3.51x51

Parameters:
  • patterns (str, default is None) – A list in CSV-format of patterns to be matched by the consensus labels. Matches are done using Unix filename pattern matching, and are allows for exclusion, e.g. “3.*,-3.5*.” will include all residues in TM3 except those in the segment 3.50…3.59

  • absolute (bool, default is False) – Instead of returning a percentage, return the nubmer of matching residues as integers

Returns:

match – The matrix of sequence identity, for residues sharing consensus labels across the different systems.

Return type:

DataFrame

property tops: dict

The topologies given at input