mdciao.nomenclature.LabelerGPCR

class mdciao.nomenclature.LabelerGPCR(*args, **kwargs)

Obtain and manipulate GPCR generic residue numbering provided by the GPCRdb.

This is based on the awesome GPCRdb REST-API.

The generic residue labels are called indistinctly “consensus labels”, “generic residue numbers” or simply “nomenclature” throughout mdciao.

The GPCRdb provides both TM-domain and GAIN-domain labels for adhesion GPCRs (aGPCRs, see below).

The GPCRdb offers different schemes for GPCR labels:

  • structure based schemes, available for all GPCR classes. You can use them by instantiating with “GPCRdb(A)”, “GPCRdb(B)” etc. The relevant references are:

  • sequence based schemes, with different names depending on the GPCR class

  • Class-A: Ballesteros et al, (1995) Integrated methods for the construction of three-dimensional models and computational probing of structure-function relations in G protein-coupled receptors, Methods in Neurosciences 25, 366–428, https://doi.org/10.1016/S1043-9471(05)80049-7

  • Class-B: Wootten et al, (2013) Polar transmembrane interactions drive formation of ligand-specific and signal pathway-biased family B G protein-coupled receptor conformations, Proceedings of the National Academy of Sciences of the United States of America 110, 5211–5216, https://doi.org/10.1073/pnas.1221585110

  • Class-C: Pin et al, (2003) Evolution, structure, and activation mechanism of family 3/C G-protein-coupled receptors Pharmacology and Therapeutics 98, 325–354 https://doi.org/10.1016/S0163-7258(03)00038-X

  • Class-F: Wu et al, (2014) Structure of a class C GPCR metabotropic glutamate receptor 1 bound to an allosteric modulator Science 344, 58–64 https://doi.org/10.1126/science.1249489

  • structure based scheme for the GAIN domain (GPCR autoproteolysis inducing) of adhesion GPCRs (aGPCRs, class B2):

Note

Not all schemes might work for all methods of this class. In particular, fragmentation heuristics are derived from 3.50(x50)-type formats. Other class A schemes like Oliveira and Baldwin-Schwarz can be used for residue mapping, labeling, but not for fragmentation. They are still partially usable but we have decided to omit them from the docs. Please see the full reference page for their citation.

__init__(*args, **kwargs)
Parameters:
  • UniProt_name (str) – Descriptor by which to find the generic residue labels, it gets directly passed to GPCRdb_finder, which operates in the following order:

    • First, try UniProt_name, as path to an existing file on disk. This is done ignoring the values of local_path and format. Supported formats are are spreadsheet (‘adrb2_human.xlsx’), or pandas binary (‘adrb2_human.pkl’). Please note that loading pickled data from untrusted sources can be unsafe. See here.

    • If the above step doesn’t find anything locally, then merge the local_path and the UniProt_name using the format. E.g. if UniProt_name=’adrb2_human’ and local_path=’my_nomenclature_files then the method looks for ‘my_nomenclature_files/adrb2_human.xlsx’

    • If none of these files can be found then ‘adrb2_human’ is looked up online in the GPCRdb, unless try_web_lookup is False.

    Note

    Please note the difference between UniProt Accession Code and UniProt entry name as explained here.

  • scheme (str, default is ‘display_generic_number’) – The nomenclature scheme to use. Has no effect for G-proteins. It is only important for GPCRs, where the default is to use what the GPCRdb itself has chosen for this particular GPCR. Not all schemes will be available for all choices of GPCRs. You can choose from: ‘BW’, ‘Wootten’, ‘Pin’, ‘Wang’, ‘Fungal’, ‘GPCRdb(A)’, ‘GPCRdb(B)’, ‘GPCRdb(C)’, ‘GPCRdb(F)’, ‘GPCRdb(D)’, ‘Oliveira’, ‘BS’, but not all are guaranteed to work

  • local_path (str, default is “.”) – Since the UniProt_name is turned into a filename, this is the local path where to (potentially) look for files. In case UniProt_name is just a filename, we can turn it into a full path to a local file using this parameter, which is passed to GPCRdb_finder.

  • format (str, default is “%s.xlsx”) – How to construct a filename out of UniProt_name. Alternative is “%s.pkl” for pandas pickle format. Please note that loading pickled data from untrusted sources can be unsafe. See here.

  • verbose (bool, default is True) – Be verbose. Gets passed to GPCRdb_finder

  • try_web_lookup (bool, default is True) – Try a web lookup on the GPCRdb of the UniProt_name. If UniProt_name is e.g. “adrb2_human.xlsx”, including the extension “xslx”, then the lookup will fail. This what the format parameter is for

  • write_to_disk (bool, default is False) – Save an Excel file with the nomenclature information

  • GPS_cleaved (bool, default is False) – If True, split the GPS-fragment into two fragments between GPS.-1 and GPS+1, s.t. they get treated as different fragments: [“GPS.-2”, “GPS.-1”] and the cleaved GPS [“GPSc.+1”]. This resembles better the situation in which the last part of the GAIN domain (GPS.+1 and then S14) have been actually cleaved

Methods

__init__(*args, **kwargs)

Parameters:
  • UniProt_name (str) -- Descriptor by which to find the generic residue labels,

aligntop(top[, restrict_to_residxs, ...])

Align a topology with the object's sequence.

conlab2residx(top[, map])

Returns a dictionary keyed by consensus labels and valued by residue indices of the input topology in top.

top2domains(top, **top2frags_kwargs)

Like top2frags but for separating a GPCR into GAIN and TM domains.

top2frags(top[, fragments, min_seqID_rate, ...])

Return the subdomains derived from the consensus nomenclature and map it out in terms of residue indices of the input top

top2labels(top[, allow_nonmatch, ...])

Align the sequence of top to the sequence used to initialize this LabelerConsensus and return a list of consensus labels for each residue in top.

Attributes

AA2conlab

Dictionary with short AA-codes as keys, so that e.g.

UniProt_name

conlab2AA

Dictionary with consensus labels as keys, so that e.g.

conlab2idx

Dictionary with consensus labels as keys and zero-indexed row-indices of self.dataframe, as values so that e.g.

dataframe

DataFrame summarizing this object's information

fragment_names

Name of the fragments according to the consensus labels

fragments

Dictionary of fragments keyed with fragment names and valued with the residue names (AAresSeq) in that fragment.

fragments_as_conlabs

Dictionary of fragments keyed with fragment names and valued with the consensus labels in that fragment

fragments_as_idxs

Dictionary of fragments keyed with fragment names and valued with idxs of the first column of self.dataframe, regardless of these residues having a consensus label or not

fragments_as_resSeqs

Dictionary of fragments keyed with fragment names and valued with the residue sequence indices (resSeq) in that fragment

idx2conlab

List of consensus labels in the order (=idx) they appear in the original dataframe

most_recent_alignment

A DataFrame with the most recent alignment

most_recent_top2labels

The most recent self.top2labels-result

seq

The reference sequence in dataframe

tablefile

The file used to instantiate this transformer

property AA2conlab

Dictionary with short AA-codes as keys, so that e.g. * self.AA2conlab[“R131”] -> ‘3.50’ * self.AA2conlab[“R201”] -> “G.hfs2.2”

aligntop(top, restrict_to_residxs=None, min_seqID_rate=0.5, fragments='resSeq', verbose=False)

Align a topology with the object’s sequence. Returns two maps (top2self, self2top) and populates the attribute self.most_recent_alignment

Wraps around mdciao.utils.sequence.align_tops_or_seqs

The indices of self are indices (row-indices) of the original dataframe, which are the ones in seq

Parameters:
  • top (Topology object or string)

  • restrict_to_residxs (iterable of integers, default is None) – Use only these residues for alignment and labelling purposes. Helps “guide” the alignment method. E.g., for big topologies the the alignment might find some small matches somewhere and, in some corner cases, match those instead of the desired ones. Here, one can pass residues indices defining the topology segment wherein the match should be contained to.

  • min_seqID_rate (float, default .5) – With big topologies and many fragments, the alignment method (mdciao.sequence.my_bioalign) sometimes yields sub-optimal results. A value min_seqID_rate >0, e.g. .5 means that a pre-alignment takes place to populate restrict_to_residxs with indices of those the fragments (mdciao.fragments.get_fragments defaults) with more than 50% alignment in the pre-alignment. If min_seqID_rate>0, :obj`restrict_to_residx` has to be None.

  • fragments (str, iterable, None, or bool, default is ‘resSeq’) – Fragment definitions to resolve situations where two (or more) alignments share the optimal alignment score. Consider aligning an input sequence ‘XXLXX’ to the object’s sequence ‘XXLLXX’. There are two equally scored alignments:

    XXL XX      XX LXX
    ||| ||  vs  || |||
    XXLLXX      XXLLXX
    

    In order to choose between these two alignments, it’s checked which alignment observes the fragment definition passed here. This definition can be passed explicitly as iterable of integers or implicitly as a fragmentation heuristic, which will be used by mdciao.fragments.get_fragments on the top. So, if e.g. the input ‘XXLXX’ sequence is fragmented (explicitly or implicitly) into [XX],[LXX], then the second alignment will be chosen, given that it respects that fragmentation.

    Note

    fragments only has an effect if both

    • the top is an actual Topology carrying the sequence

    indices, since if top is a sequence string, then there’s no fragmentation heuristic possible.

    • two or more alignments share the optimal alignment score

    The method avoids breaking the consensus definitions across the input fragments, while also providing consensus definitions for those other residues not present in fragments. This is done by using ‘resSeq’ to infer the missing fragmentation. This keeps the functionality of respecting the original fragments while also providing consensus fragmentation other parts of the topology. For compatibility with other methods, passing fragments=None will still use the fragmentation heuristic (this might change in the future). To explicitly circumvent this forced fragmentation and subsequent check, use `fragments=False`. This will simply use the first alignment that comes out of mdciao.utils.sequence.my_bioalign , regardless of there being other, equally scored, alignments and potential clashes with sensitive fragmentations.

  • verbose (boolean, default is False) – be verbose

Returns:

  • top2self (dict) – Maps indices of top to indices of this objects self.seq

  • self2top (dict) – Maps indices of this object’s seq.seq to indices of this self.seq

property conlab2AA

Dictionary with consensus labels as keys, so that e.g. * self.conlab2AA[“3.50”] -> ‘R131’ or * self.conlab2AA[“G.hfs2.2”] -> ‘R201’

property conlab2idx

Dictionary with consensus labels as keys and zero-indexed row-indices of self.dataframe, as values so that e.g. * self.conlab2AA[“3.50”] -> ‘R131’ or * self.conlab2AA[“G.hfs2.2”] -> ‘R201’

conlab2residx(top, map=None, **top2labels_kwargs)

Returns a dictionary keyed by consensus labels and valued by residue indices of the input topology in top.

The default behaviour is to internally align top with the object’s available consensus dictionary on the fly using top2labels. See the docs there for **top2labels_kwargs, in particular restrict_to_residxs, keep_consensus, and min_seqID_rate

Note

This method is able to work with a new topology every time, performing a sequence alignment every call. The intention is to instantiate a LabelerConsensus just one time and use it with as many topologies as you like without changing any attribute of self.

HOWEVER, if you know what you are doing, you can provide a list of consensus labels yourself using map. Then, this method is nothing but a table lookup (almost)

Warning

No checks are performed to see if the input of map actually matches the residues of top in any way, so that the output can be rubbish and go unnoticed.

Parameters:
  • top (Topology)

  • map (list, default is None) – A pre-computed residx2consensuslabel map, i.e. the output of a previous, external call to _top2consensus_map If it contains duplicates, it is a malformed list. See the note above for more info

  • top2labels_kwargs (dict) – Optional parameters for top2labels, which are listed below

Other Parameters:
  • allow_nonmatch (bool, default is True) – Use consensus labels for non-matching positions in case the non-matches have equal lengths

  • autofill_consensus (boolean default is False) – Even if there is a consensus mismatch with the sequence of the input AA2conlab_dict, try to relabel automagically, s.t.

    • [‘G.H5.25’, ‘G.H5.26’, None, ‘G.H.28’]

    will be relabeled as
    • [‘G.H5.25’, ‘G.H5.26’, ‘G.H.27’, ‘G.H.28’]

  • min_seqID_rate (float, default is .5) – With big topologies and many fragments, the alignment method (mdciao.sequence.my_bioalign) sometimes yields sub-optimal results. A value min_seqID_rate >0, e.g. .5 means that a pre-alignment takes place to populate restrict_to_residxs with indices of those the fragments (mdciao.fragments.get_fragments defaults) with more than 50% alignment in the pre-alignment.

  • restrict_to_residxs (iterable of integers, default is None) – Use only these residues for alignment and labelling purposes. Helps “guide” the alignment method. E.g., for big topologies the the alignment might find some small matches somewhere and, in some corner cases, match those instead of the desired ones. Here, one can pass residues indices defining the topology segment wherein the match should be contained to.

  • fragments (str, iterable, None, or bool, default is ‘resSeq’) – Fragment definitions to resolve situations where two (or more) alignments share the optimal alignment score. Consider aligning an input sequence ‘XXLXX’ to the object’s sequence ‘XXLLXX’. There are two equally scored alignments:

    XXL XX      XX LXX
    ||| ||  vs  || |||
    XXLLXX      XXLLXX
    

    In order to choose between these two alignments, it’s checked which alignment observes the fragment definition passed here. This definition can be passed explicitly as iterable of integers or implicitly as a fragmentation heuristic, which will be used by mdciao.fragments.get_fragments on the top. So, if e.g. the input ‘XXLXX’ sequence is fragmented (explicitly or implicitly) into [XX],[LXX], then the second alignment will be chosen, given that it respects that fragmentation.

    Note

    fragments only has an effect if both

    • the top is an actual Topology carrying the sequence

    indices, since if top is a sequence string, then there’s no fragmentation heuristic possible.

    • two or more alignments share the optimal alignment score

    The method avoids breaking the consensus definitions across the input fragments, while also providing consensus definitions for those other residues not present in fragments. This is done by using ‘resSeq’ to infer the missing fragmentation. This keeps the functionality of respecting the original fragments while also providing consensus fragmentation other parts of the topology. For compatibility with other methods, passing fragments=None will still use the fragmentation heuristic (this might change in the future). To explicitly circumvent this forced fragmentation and subsequent check, use `fragments=False`. This will simply use the first alignment that comes out of mdciao.utils.sequence.my_bioalign , regardless of there being other, equally scored, alignments and potential clashes with sensitive fragmentations.

  • verbose (boolean, default is False) – be verbose

Returns:

dict

Return type:

keyed by consensus labels and valued with residue idxs

property dataframe: DataFrame

DataFrame summarizing this object’s information

Returns:

df

Return type:

DataFrame

property fragment_names

Name of the fragments according to the consensus labels

TODO OR NOT? Check!

property fragments

Dictionary of fragments keyed with fragment names and valued with the residue names (AAresSeq) in that fragment.

property fragments_as_conlabs

Dictionary of fragments keyed with fragment names and valued with the consensus labels in that fragment

property fragments_as_idxs

Dictionary of fragments keyed with fragment names and valued with idxs of the first column of self.dataframe, regardless of these residues having a consensus label or not

property fragments_as_resSeqs: dict

Dictionary of fragments keyed with fragment names and valued with the residue sequence indices (resSeq) in that fragment

Returns:

fragments_as_resSeqs

Return type:

dict

property idx2conlab

List of consensus labels in the order (=idx) they appear in the original dataframe

This index is the row-index of the table, don’t count on it being aligned with anything

property most_recent_alignment: DataFrame

A DataFrame with the most recent alignment

Expert use only

Returns:

df

Return type:

DataFrame

property most_recent_top2labels

The most recent self.top2labels-result

Expert use only

Returns:

df

Return type:

list

property seq

The reference sequence in dataframe

property tablefile

The file used to instantiate this transformer

top2domains(top, **top2frags_kwargs)

Like top2frags but for separating a GPCR into GAIN and TM domains.

As in top2frags, regions of the topology withoug generic residue labels (e.g. ligands or other components) are left out of the returned domains.

Parameters:
  • topTopology or path to topology file (e.g. a pdb)

  • fragments (iterable of integers, default is None) – Any useful fragment definition as lists of residue indices. Useful means:

    • Help with the alignment needed for consensus fragment definition. Look at LabelerConsensus.aligntop and its fragments and min_seqID_rate parameters.

    • Check if the newly found, consensus fragment definitions (defs) clash with the input in fragments. Clash* means that the defs would span over more than one of the fragments in defined in fragments.

    An interactive prompt will ask the user which fragments to keep in case of clashes.

    Check the method check_if_fragment_clashes for more info.

  • min_seqID_rate (float, default is .5) – With big topologies, like a receptor-Gprotein system, the “brute-force” alignment method (check mdciao.sequence.my_bioalign) sometimes yields sub-optimal results, e.g. finding short snippets of reference sequence that align in a completely wrong part of the topology. To avoid this, an initial, exploratory alignment is carried out. min_seqID_rate = .5 means that only the fragments (mdciao.fragments.get_fragments defaults) with more than 50% alignment in this exploration are used to improve the second alignment

  • input_dataframe (DataFrame, default is None) – Expert option, use at your own risk. Instead of aligning top to the object’s sequence to derive fragment definitions, input an existing alignment here, e.g. the self.most_recent_alignment

  • show_alignment (bool, default is False,) – Show the entire alignment as DataFrame

  • atoms (bool, default is False) – Instead of returning residue indices, return atom indices

  • verbose (bool, default is True) – Also print the definitions

Returns:

domains – A list of two lists, each one containing the residue indices of the GAIN domain and of the TM domain, in that order. Empty list(s) represent missing domains.

Return type:

list

top2frags(top, fragments=None, min_seqID_rate=0.5, input_dataframe=None, show_alignment=False, atoms=False, verbose=True) dict

Return the subdomains derived from the consensus nomenclature and map it out in terms of residue indices of the input top

Note

This method uses aligntop internally, see the doc on that method for more info.

Parameters:
  • topTopology or path to topology file (e.g. a pdb)

  • fragments (iterable of integers, default is None) – Any useful fragment definition as lists of residue indices. Useful means:

    • Help with the alignment needed for consensus fragment definition. Look at LabelerConsensus.aligntop and its fragments and min_seqID_rate parameters.

    • Check if the newly found, consensus fragment definitions (defs) clash with the input in fragments. Clash* means that the defs would span over more than one of the fragments in defined in fragments.

    An interactive prompt will ask the user which fragments to keep in case of clashes.

    Check the method check_if_fragment_clashes for more info.

  • min_seqID_rate (float, default is .5) – With big topologies, like a receptor-Gprotein system, the “brute-force” alignment method (check mdciao.sequence.my_bioalign) sometimes yields sub-optimal results, e.g. finding short snippets of reference sequence that align in a completely wrong part of the topology. To avoid this, an initial, exploratory alignment is carried out. min_seqID_rate = .5 means that only the fragments (mdciao.fragments.get_fragments defaults) with more than 50% alignment in this exploration are used to improve the second alignment

  • input_dataframe (DataFrame, default is None) – Expert option, use at your own risk. Instead of aligning top to the object’s sequence to derive fragment definitions, input an existing alignment here, e.g. the self.most_recent_alignment

  • show_alignment (bool, default is False,) – Show the entire alignment as DataFrame

  • atoms (bool, default is False) – Instead of returning residue indices, return atom indices

  • verbose (bool, default is True) – Also print the definitions

Returns:

defs – Dictionary with subdomain names as keys and arrays of indices (residue or atom) as values

Return type:

dictionary

top2labels(top, allow_nonmatch=True, autofill_consensus=True, min_seqID_rate=0.5, **aligntop_kwargs) list

Align the sequence of top to the sequence used to initialize this LabelerConsensus and return a list of consensus labels for each residue in top.

Populates the attributes most_recent_top2labels and most_recent_alignment

If a consensus label is returned as None it means one of two things:

  • this position was successfully aligned with a match but the data used to initialize this ConsensusLabeler did not contain a label

  • this position has a label in the original data but the sequence alignment is not matched (e.g., bc of a point mutation)

To remedy the second case a-posteriori two things can be done:

  • recover the original label even though residues did not match, using allow_nonmatch. See alignment_df2_conslist for more info

  • reconstruct what the label could be using a heuristic to “autofill” the consensus labels, using autofill_consensus. See _fill_consensus_gaps for more info.

Note

This method uses aligntop internally, see the doc on that method for more info.

Parameters:
  • top (Topology object or str) – The topology as an object or a path to a filename, e.g. a pdb file.

  • allow_nonmatch (bool, default is True) – Use consensus labels for non-matching positions in case the non-matches have equal lengths

  • autofill_consensus (boolean default is False) – Even if there is a consensus mismatch with the sequence of the input AA2conlab_dict, try to relabel automagically, s.t.

    • [‘G.H5.25’, ‘G.H5.26’, None, ‘G.H.28’]

    will be relabeled as
    • [‘G.H5.25’, ‘G.H5.26’, ‘G.H.27’, ‘G.H.28’]

  • min_seqID_rate (float, default is .5) – With big topologies and many fragments, the alignment method (mdciao.sequence.my_bioalign) sometimes yields sub-optimal results. A value min_seqID_rate >0, e.g. .5 means that a pre-alignment takes place to populate restrict_to_residxs with indices of those the fragments (mdciao.fragments.get_fragments defaults) with more than 50% alignment in the pre-alignment.

  • aligntop_kwargs (dict) – Optional parameters for aligntop, which are listed below

Other Parameters:
  • restrict_to_residxs (iterable of integers, default is None) – Use only these residues for alignment and labelling purposes. Helps “guide” the alignment method. E.g., for big topologies the the alignment might find some small matches somewhere and, in some corner cases, match those instead of the desired ones. Here, one can pass residues indices defining the topology segment wherein the match should be contained to.

  • min_seqID_rate (float, default .5) – With big topologies and many fragments, the alignment method (mdciao.sequence.my_bioalign) sometimes yields sub-optimal results. A value min_seqID_rate >0, e.g. .5 means that a pre-alignment takes place to populate restrict_to_residxs with indices of those the fragments (mdciao.fragments.get_fragments defaults) with more than 50% alignment in the pre-alignment. If min_seqID_rate>0, :obj`restrict_to_residx` has to be None.

  • fragments (str, iterable, None, or bool, default is ‘resSeq’) – Fragment definitions to resolve situations where two (or more) alignments share the optimal alignment score. Consider aligning an input sequence ‘XXLXX’ to the object’s sequence ‘XXLLXX’. There are two equally scored alignments:

    XXL XX      XX LXX
    ||| ||  vs  || |||
    XXLLXX      XXLLXX
    

    In order to choose between these two alignments, it’s checked which alignment observes the fragment definition passed here. This definition can be passed explicitly as iterable of integers or implicitly as a fragmentation heuristic, which will be used by mdciao.fragments.get_fragments on the top. So, if e.g. the input ‘XXLXX’ sequence is fragmented (explicitly or implicitly) into [XX],[LXX], then the second alignment will be chosen, given that it respects that fragmentation.

    Note

    fragments only has an effect if both

    • the top is an actual Topology carrying the sequence

    indices, since if top is a sequence string, then there’s no fragmentation heuristic possible.

    • two or more alignments share the optimal alignment score

    The method avoids breaking the consensus definitions across the input fragments, while also providing consensus definitions for those other residues not present in fragments. This is done by using ‘resSeq’ to infer the missing fragmentation. This keeps the functionality of respecting the original fragments while also providing consensus fragmentation other parts of the topology. For compatibility with other methods, passing fragments=None will still use the fragmentation heuristic (this might change in the future). To explicitly circumvent this forced fragmentation and subsequent check, use `fragments=False`. This will simply use the first alignment that comes out of mdciao.utils.sequence.my_bioalign , regardless of there being other, equally scored, alignments and potential clashes with sensitive fragmentations.

  • verbose (boolean, default is False) – be verbose

Returns:

map – List of len = top.n_residues with the consensus labels

Return type:

list