mdciao.nomenclature.LabelerCGN¶
-
class
mdciao.nomenclature.
LabelerCGN
(PDB_input, local_path='.', try_web_lookup=True, verbose=True, write_to_disk=None)¶ Obtain and manipulate common-Gprotein-nomenclature. See https://www.mrc-lmb.cam.ac.uk/CGN/faq.html for more info.
-
__init__
(PDB_input, local_path='.', try_web_lookup=True, verbose=True, write_to_disk=None)¶ - Parameters
PDB_input (str) –
The PDB-file to be used. For compatibility reasons, there’s different use-cases.
Full path to an existing file containing the CGN nomenclature,
- e.g. ‘/abs/path/to/some/dir/CGN_ABCD.txt’ (or ABCD.txt). Then this happens:
local_path
gets overridden with ‘/abs/path/to/some/dir/’a PDB four-letter-code is inferred from the filename, e.g. ‘ABCD’
a file ‘/abs/path/to/some/dir/ABCD.pdb(.gz)’ is looked for
if not found and
try_web_lookup
is True, then ‘ABCD’ is looked up online in the PDB rcsb database
Full path to an existing PDB-file, e.g. ‘/abs/path/to/some/dir/ABCD.pdb(.gz)’. Then this happens:
local_path
gets overridden with ‘/abs/path/to/some/dir/’a file ‘/abs/path/to/some/dir/CGN_ABCD.txt is looked for
if not found and
try_web_lookup
is True, then ‘ABCD’ is looked up online in the CGN database
- Four letter code, e.g. ‘ABCD’. Then this happens:
look for the files ‘3SN6.pdb’ and ‘CGN_3SN6.txt’ in
local_path
if one or both of these files cannot be found there, look up in their respective online databases (if
try_web_lookup
is True)
Note
The intention behind this flexibility (which is hard to document and maintain) is to keep the signature of consensus labelers somewhat consistent for compatibility with other command line methods
- local_path: str, default is ‘.’
The local path where these files exist, if they exist
- try_web_lookup: bool, default is True
If the local files are not found, try automatically a web lookup at * www.mrc-lmb.cam.ac.uk (for CGN) * rcsb.org (for the PDB)
Methods
__init__
(PDB_input[, local_path, …])- param PDB_input
The PDB-file to be used. For compatibility reasons, there’s different use-cases.
aligntop
(top[, restrict_to_residxs, …])Align a topology with the object’s sequence.
conlab2residx
(top[, map])Returns a dictionary keyed by consensus labels and valued by residue indices of the input topology in
top
.top2frags
(top[, fragments, min_hit_rate, …])Return the subdomains derived from the consensus nomenclature and map it out in terms of residue indices of the input
top
top2labels
(top[, allow_nonmatch, …])Align the sequence of
top
to the sequence used to initialize thisLabelerConsensus
and return a list of consensus labels for each residue intop
.Attributes
Dictionary with short AA-codes as keys, so that e.g.
Dictionary with consensus labels as keys, so that e.g.
Dictionary with consensus labels as keys and zero-indexed row-indices of self.dataframe, as values so that e.g.
DataFrame
summarizing this object’s informationName of the fragments according to the consensus labels
Dictionary of fragments keyed with fragment names and valued with the residue names (AAresSeq) in that fragment.
Dictionary of fragments keyed with fragment names and valued with the consensus labels in that fragment
Dictionary of fragments keyed with fragment names and valued with idxs of the first column of self.dataframe, regardless of these residues having a consensus label or not
Trajectory
with with what was found (locally or online) usingref_PDB
List of consensus labels in the order (=idx) they appear in the original dataframe
A
DataFrame
with the most recent alignmentThe most recent
self.top2labels
-resultPDB code used for instantiation
The reference sequence in
dataframe
The file used to instantiate this transformer
Topology
with with what was found (locally or online) usingref_PDB
-
property
AA2conlab
¶ Dictionary with short AA-codes as keys, so that e.g. * self.AA2conlab[“R131”] -> ‘3.50’ * self.AA2conlab[“R201”] -> “G.hfs2.2”
-
aligntop
(top, restrict_to_residxs=None, min_hit_rate=0.5, fragments='resSeq', verbose=False)¶ Align a topology with the object’s sequence. Returns two maps (top2self, self2top) and populates the attribute self.most_recent_alignment
Wraps around
mdciao.utils.sequence.align_tops_or_seqs
The indices of self are indices (row-indices) of the original
dataframe
, which are the ones inseq
- Parameters
top (
Topology
object or string) –restrict_to_residxs (iterable of integers, default is None) – Use only these residues for alignment and labelling purposes. Helps “guide” the alignment method. E.g., for big topologies the the alignment might find some small matches somewhere and, in some corner cases, match those instead of the desired ones. Here, one can pass residues indices defining the topology segment wherein the match should be contained to.
min_hit_rate (float, default .5) – With big topologies and many fragments, the alignment method (
mdciao.sequence.my_bioalign
) sometimes yields sub-optimal results. A valuemin_hit_rate
>0, e.g. .5 means that a pre-alignment takes place to populaterestrict_to_residxs
with indices of those the fragments (mdciao.fragments.get_fragments
defaults) with more than 50% alignment in the pre-alignment. Ifmin_hit_rate
>0, :obj`restrict_to_residx` has to be None.fragments (str, iterable, None, or bool, default is 'resSeq') –
Fragment definitions to resolve situations where two (or more) alignments share the optimal alignment score. Consider aligning an input sequence ‘XXLXX’ to the object’s sequence ‘XXLLXX’. There are two equally scored alignments:
XXL XX XX LXX ||| || vs || ||| XXLLXX XXLLXX
In order to choose between these two alignments, it’s checked which alignment observes the fragment definition passed here. This definition can be passed explicitly as iterable of integers or implicitly as a fragmentation heuristic, which will be used by
mdciao.fragments.get_fragments
on the top. So, if e.g. the input ‘XXLXX’ sequence is fragmented (explicitly or implicitly) into [XX],[LXX], then the second alignment will be chosen, given that it respects that fragmentation.Note
- fragments only has an effect if both
the top is an actual
Topology
carrying the sequence
indices, since if top is a sequence string, then there’s no fragmentation heuristic possible.
two or more alignments share the optimal alignment score
The method avoids breaking the consensus definitions across the input fragments, while also providing consensus definitions for those other residues not present in fragments. This is done by using ‘resSeq’ to infer the missing fragmentation. This keeps the functionality of respecting the original fragments while also providing consensus fragmentation other parts of the topology. For compatibility with other methods, passing fragments=None will still use the fragmentation heuristic (this might change in the future). To explicitly circumvent this forced fragmentation and subsequent check, use `fragments=False`. This will simply use the first alignment that comes out of :obj:`mdciao.utils.sequence.my_bioalign`, regardless of there being other, equally scored, alignments and potential clashes with sensitive fragmentations.
verbose (boolean, default is False) – be verbose
- Returns
top2self (dict) – Maps indices of top to indices of this objects self.seq
self2top (dict) – Maps indices of this object’s seq.seq to indices of this self.seq
-
property
conlab2AA
¶ Dictionary with consensus labels as keys, so that e.g. * self.conlab2AA[“3.50”] -> ‘R131’ or * self.conlab2AA[“G.hfs2.2”] -> ‘R201’
-
property
conlab2idx
¶ Dictionary with consensus labels as keys and zero-indexed row-indices of self.dataframe, as values so that e.g. * self.conlab2AA[“3.50”] -> ‘R131’ or * self.conlab2AA[“G.hfs2.2”] -> ‘R201’
-
conlab2residx
(top, map=None, **top2labels_kwargs)¶ Returns a dictionary keyed by consensus labels and valued by residue indices of the input topology in
top
.The default behaviour is to internally align
top
with the object’s available consensus dictionary on the fly usingself.top2labels
. See the docs there for **top2labels_kwargs, in particular restrict_to_residxs, keep_consensus, and min_hit_rateNote
This method is able to work with a new topology every time, performing a sequence alignment every call. The intention is to instantiate a
LabelerConsensus
just one time and use it with as many topologies as you like without changing any attribute ofself
.HOWEVER, if you know what you are doing, you can provide a list of consensus labels yourself using
map
. Then, this method is nothing but a table lookup (almost)Warning
No checks are performed to see if the input of
map
actually matches the residues oftop
in any way, so that the output can be rubbish and go unnoticed.- Parameters
top (
Topology
) –map (list, default is None) – A pre-computed residx2consensuslabel map, i.e. the output of a previous, external call to
_top2consensus_map
If it contains duplicates, it is a malformed list. See the note above for more info
- Returns
dict
- Return type
keyed by consensus labels and valued with residue idxs
-
property
fragment_names
¶ Name of the fragments according to the consensus labels
TODO OR NOT? Check!
-
property
fragments
¶ Dictionary of fragments keyed with fragment names and valued with the residue names (AAresSeq) in that fragment.
-
property
fragments_as_conlabs
¶ Dictionary of fragments keyed with fragment names and valued with the consensus labels in that fragment
-
property
fragments_as_idxs
¶ Dictionary of fragments keyed with fragment names and valued with idxs of the first column of self.dataframe, regardless of these residues having a consensus label or not
-
property
geom
¶ Trajectory
with with what was found (locally or online) usingref_PDB
-
property
idx2conlab
¶ List of consensus labels in the order (=idx) they appear in the original dataframe
This index is the row-index of the table, don’t count on it being aligned with anything
-
property
most_recent_alignment
¶ A
DataFrame
with the most recent alignmentExpert use only
- Returns
df
- Return type
-
property
most_recent_top2labels
¶ The most recent
self.top2labels
-resultExpert use only
- Returns
df
- Return type
list
-
property
ref_PDB
¶ PDB code used for instantiation
-
property
tablefile
¶ The file used to instantiate this transformer
-
top2frags
(top, fragments=None, min_hit_rate=0.5, input_dataframe=None, show_alignment=False, verbose=True)¶ Return the subdomains derived from the consensus nomenclature and map it out in terms of residue indices of the input
top
Note
This method uses
aligntop
internally, see the doc on that method for more info.- Parameters
top –
Topology
or path to topology file (e.g. a pdb)fragments (iterable of integers, default is None) –
The user can parse an existing list of fragment-definitions (via residue idxs) to check if the newly found, consensus definitions (defs) clash with the input in fragments. Clash means that the defs would span over more than one of the fragments in defined in
fragments
.An interactive prompt will ask the user which fragments to keep in case of clashes.
Check
check_if_subfragment
for more infomin_hit_rate (float, default is .5) – With big topologies, like a receptor-Gprotein system, the “brute-force” alignment method (check
mdciao.sequence.my_bioalign
) sometimes yields sub-optimal results, e.g. finding short snippets of reference sequence that align in a completely wrong part of the topology. To avoid this, an initial, exploratory alignment is carried out.min_hit_rate
= .5 means that only the fragments (mdciao.fragments.get_fragments
defaults) with more than 50% alignment in this exploration are used to improve the second alignmentinput_dataframe (
DataFrame
, default is None) – Expert option, use at your own risk. Instead of aligningtop
to the object’s sequence to derive fragment definitions, input an existing alignment here, e.g. the self.most_recent_alignmentshow_alignment (bool, default is False,) – Show the entire alignment as
DataFrame
verbose (bool, default is True) – Also print the definitions
- Returns
defs – Dictionary with subdomain names as keys and lists of indices as values
- Return type
dictionary
-
top2labels
(top, allow_nonmatch=True, autofill_consensus=True, min_hit_rate=0.5, **aligntop_kwargs)¶ Align the sequence of
top
to the sequence used to initialize thisLabelerConsensus
and return a list of consensus labels for each residue intop
.Populates the attributes
most_recent_top2labels
andmost_recent_alignment
If a consensus label is returned as None it means one of two things:
this position was successfully aligned with a match but the data used to initialize this
ConsensusLabeler
did not contain a labelthis position has a label in the original data but the sequence alignment is not matched (e.g., bc of a point mutation)
To remedy the second case a-posteriori two things can be done:
recover the original label even though residues did not match, using
allow_nonmatch
. Seealignment_df2_conslist
for more inforeconstruct what the label could be using a heuristic to “autofill” the consensus labels, using
autofill_consensus
. See_fill_consensus_gaps
for more info
Note
This method uses
aligntop
internally, see the doc on that method for more info.- Parameters
top –
Topology
objectallow_nonmatch (bool, default is True) – Use consensus labels for non-matching positions in case the non-matches have equal lengths
autofill_consensus (boolean default is False) –
Even if there is a consensus mismatch with the sequence of the input
AA2conlab_dict
, try to relabel automagically, s.t.[‘G.H5.25’, ‘G.H5.26’, None, ‘G.H.28’]
will be grouped relabeled as * [‘G.H5.25’, ‘G.H5.26’, ‘G.H.27’, ‘G.H.28’]
min_hit_rate (float, default is .5) – With big topologies and many fragments, the alignment method (
mdciao.sequence.my_bioalign
) sometimes yields sub-optimal results. A valuemin_hit_rate
>0, e.g. .5 means that a pre-alignment takes place to populaterestrict_to_residxs
with indices of those the fragments (mdciao.fragments.get_fragments
defaults) with more than 50% alignment in the pre-alignment.
- Returns
map
- Return type
list of len = top.n_residues with the consensus labels
-