mdciao.nomenclature.LabelerKLIFS
- class mdciao.nomenclature.LabelerKLIFS(KLIFS_string, local_path='.', format='KLIFS_%s.xlsx', verbose=True, try_web_lookup=True, write_to_disk=False, keep_PDB_geom=True)
Obtain and manipulate Kinase-Ligand Interaction notation of the 85 pocket-residues of kinases.
The residue notation is obtained from the Kinase–Ligand Interaction Fingerprints and Structure database, KLIFS.
Since the KLIFS database serves residue labels associated with specific PDBs, and there is more than one PDB per kinase, the lookup logic, implemented by the low-level method
_KLIFS_web_lookup
, allows for some flexibility in the input:Query via a UniProt Accession Code, which yields a kinase ID (which are internal to KLIFS), which has quality-scored PDBs associated to it and then get labels from the highest-scored PDB (which has a unique structure ID internal to KLIFS). Please note the difference between UniProt Accession Code and UniProt entry name as explained here.
Skip the first step and query directly via a KLIFS kinase ID. From that kinase ID, follow the same logic as above to get the associated highest-scored PDB (and its KLIFS structure ID) and then get the labels.
Skip the first and second step and query directly via KLIFS structure ID, and get the labels from the PDB associted to it, regardless of its score.
Please see the docstring on KLIFS_string on how to choose between the above strategies.
Once the PDB and structure ID have been determined, then this method also
gets the 85 pocket residue indices (in that specific PDB file) and their consensus names.
gets a geometry containing the kinase and other chains associated to the kinase in the original PDB, according to the docs , that means “the full structure (including solvent, cofactors, ligands, etc.) in PDB format”
All the above information is stored in this object and accessible via its attributes, check their individual documentation for more info.
The local lookup logic, implemented by the low-level method
_KLIFS_finder
, is:Use the KLIFS_string directly or in combination with format`=”KLIFS_%s.xlsx” and `local_path to locate a local excel file. That excel file has been generated previously by calling LabelerKLIFS with write_to_disk=True or by using the LabelerKLIFS.dataframe.to_excel method of an already instantiated LabelerKLIFS object. That Excel file will contain, apart from the nomenclature, all other attributes, including the PDB geometry, needed to re-generate the LabelerKLIFS locally. An example Excel file has been distributed with mdciao and you can find it with:
>>> import mdciao >>> mdciao.examples.filenames.KLIFS_P31751_xlsx
References
These are the most relevant references on the nomenclature itself, but please check how to cite KLIFS in case of doubt:
Van Linden, O. P. J., Kooistra, A. J., Leurs, R., De Esch, I. J. P., & De Graaf, C. (2014). KLIFS: A knowledge-based structural database to navigate kinase-ligand interaction space. Journal of Medicinal Chemistry, 57(2), 249–277. https://doi.org/10.1021/JM400378W
Kooistra, A. J., Kanev, G. K., Van Linden, O. P. J., Leurs, R., De Esch, I. J. P., & De Graaf, C. (2016). KLIFS: a structural kinase-ligand interaction database. Nucleic Acids Research, 44(D1), D365–D371. https://doi.org/10.1093/NAR/GKV1082
Kanev, G. K., de Graaf, C., Westerman, B. A., de Esch, I. J. P., & Kooistra, A. J. (2021). KLIFS: an overhaul after the first 5 years of supporting kinase research. Nucleic Acids Research, 49(D1), D562–D569. https://doi.org/10.1093/NAR/GKAA895
- __init__(KLIFS_string, local_path='.', format='KLIFS_%s.xlsx', verbose=True, try_web_lookup=True, write_to_disk=False, keep_PDB_geom=True)
- Parameters:
KLIFS_string (str) – A string with a KLIFS identifier to be processed, or a filename for local lookup. If string, it has to be formatted “key:value” which ultimately leads to a given KLIFS entry (see above) Acceptable keys and values for KLIFS_string are:
“UniProtAC”, e.g. “UniProtAC:P31751”
“kinase_ID”, e.g. “kinase_ID:2”
“structure_ID”, e.g. “structure_ID:1904”
Any of the above keys will yield the same labels, since the UniProtAC can be used to look up the kinase_ID, and the kinase_ID automatically picks the best structure_ID (PDB), but the user can choose to specify directly the kinase_ID or the structure_ID.
If a local file is to be used instead of online lookup, anything pointing to a valid file works, e.g. ‘KLIFS_P31751.xlsx’.
Finally, when the above fails, it will try to construct a file by combining KLIFS_string with the format and local_path arguments
local_path (str, default is “.”) – Since the KLIFS_string can be a filename (or turned into one, see above), this is the local path where to (potentially) look for files. Note that this optional parameter is here for compatibility reasons with other methods and might disappear in the future.
format (str, default is ‘KLIFS_%s.xlsx’) – A format string that turns the KLIFS_string directly into a filename for local lookup, in case the user has custom filenames, e.g. if the KLIFS_string=”P31751” then this format specifier will turn it into KLIFS_P31751.xlsx.
verbose (bool, default is True) – Be verbose. Gets passed to
_KLIFS_finder
try_web_lookup (bool, default is True) – Try a web lookup on the KLIFS via KLIFS_string. If KLIFS_string is e.g. “KLIFS_P31751.xlsx”, including the extension “xslx”, then the lookup will fail. This what the format parameter is for
write_to_disk (bool, default is False) – Save an excel file with the nomenclature information.
keep_PDB_geom (bool, default is True) – If False, don’t store the PDB geom in returned DataFrame when looking online or locally. For online lookups, the geom will have been downloaded and used though, it’s just not stored as extra sheets in the returned DataFrame, making the method faster and lighter when the PDB geoms are not really needed. For local lookups, if the local file was stored w/o the geom in the extra sheets and you are reading from the file with this parameter set to True, you’ll get an error.
Methods
__init__
(KLIFS_string[, local_path, format, ...])- Parameters:
KLIFS_string (str) -- A string with a KLIFS identifier to be processed,
aligntop
(top[, restrict_to_residxs, ...])Align a topology with the object's sequence.
conlab2residx
(top[, map])Returns a dictionary keyed by consensus labels and valued by residue indices of the input topology in top.
top2frags
(top[, fragments, min_seqID_rate, ...])Return the subdomains derived from the consensus nomenclature and map it out in terms of residue indices of the input top
top2labels
(top[, allow_nonmatch, ...])Align the sequence of
top
to the sequence used to initialize thisLabelerConsensus
and return a list of consensus labels for each residue intop
.Attributes
Dictionary with short AA-codes as keys, so that e.g.
Dictionary with consensus labels as keys, so that e.g.
Dictionary with consensus labels as keys and zero-indexed row-indices of self.dataframe, as values so that e.g.
DataFrame
summarizing this object's informationName of the fragments according to the consensus labels
Dictionary of fragments keyed with fragment names and valued with the residue names (AAresSeq) in that fragment.
Dictionary of fragments keyed with fragment names and valued with the consensus labels in that fragment
Dictionary of fragments keyed with fragment names and valued with idxs of the first column of self.dataframe, regardless of these residues having a consensus label or not
Dictionary of fragments keyed with fragment names and valued with the residue sequence indices (resSeq) in that fragment
List of consensus labels in the order (=idx) they appear in the original dataframe
A
DataFrame
with the most recent alignmentThe most recent
self.top2labels
-resultThe reference sequence in
dataframe
The file used to instantiate this transformer
- property AA2conlab
Dictionary with short AA-codes as keys, so that e.g. * self.AA2conlab[“R131”] -> ‘3.50’ * self.AA2conlab[“R201”] -> “G.hfs2.2”
- aligntop(top, restrict_to_residxs=None, min_seqID_rate=0.5, fragments='resSeq', verbose=False)
Align a topology with the object’s sequence. Returns two maps (top2self, self2top) and populates the attribute self.most_recent_alignment
Wraps around
mdciao.utils.sequence.align_tops_or_seqs
The indices of self are indices (row-indices) of the original
dataframe
, which are the ones inseq
- Parameters:
top (
Topology
object or string)restrict_to_residxs (iterable of integers, default is None) – Use only these residues for alignment and labelling purposes. Helps “guide” the alignment method. E.g., for big topologies the the alignment might find some small matches somewhere and, in some corner cases, match those instead of the desired ones. Here, one can pass residues indices defining the topology segment wherein the match should be contained to.
min_seqID_rate (float, default .5) – With big topologies and many fragments, the alignment method (
mdciao.sequence.my_bioalign
) sometimes yields sub-optimal results. A valuemin_seqID_rate
>0, e.g. .5 means that a pre-alignment takes place to populaterestrict_to_residxs
with indices of those the fragments (mdciao.fragments.get_fragments
defaults) with more than 50% alignment in the pre-alignment. Ifmin_seqID_rate
>0, :obj`restrict_to_residx` has to be None.fragments (str, iterable, None, or bool, default is ‘resSeq’) – Fragment definitions to resolve situations where two (or more) alignments share the optimal alignment score. Consider aligning an input sequence ‘XXLXX’ to the object’s sequence ‘XXLLXX’. There are two equally scored alignments:
XXL XX XX LXX ||| || vs || ||| XXLLXX XXLLXX
In order to choose between these two alignments, it’s checked which alignment observes the fragment definition passed here. This definition can be passed explicitly as iterable of integers or implicitly as a fragmentation heuristic, which will be used by
mdciao.fragments.get_fragments
on the top. So, if e.g. the input ‘XXLXX’ sequence is fragmented (explicitly or implicitly) into [XX],[LXX], then the second alignment will be chosen, given that it respects that fragmentation.Note
fragments only has an effect if both
the top is an actual
Topology
carrying the sequence
indices, since if top is a sequence string, then there’s no fragmentation heuristic possible.
two or more alignments share the optimal alignment score
The method avoids breaking the consensus definitions across the input fragments, while also providing consensus definitions for those other residues not present in fragments. This is done by using ‘resSeq’ to infer the missing fragmentation. This keeps the functionality of respecting the original fragments while also providing consensus fragmentation other parts of the topology. For compatibility with other methods, passing fragments=None will still use the fragmentation heuristic (this might change in the future). To explicitly circumvent this forced fragmentation and subsequent check, use `fragments=False`. This will simply use the first alignment that comes out of
mdciao.utils.sequence.my_bioalign
, regardless of there being other, equally scored, alignments and potential clashes with sensitive fragmentations.verbose (boolean, default is False) – be verbose
- Returns:
top2self (dict) – Maps indices of top to indices of this objects self.seq
self2top (dict) – Maps indices of this object’s seq.seq to indices of this self.seq
- property conlab2AA
Dictionary with consensus labels as keys, so that e.g. * self.conlab2AA[“3.50”] -> ‘R131’ or * self.conlab2AA[“G.hfs2.2”] -> ‘R201’
- property conlab2idx
Dictionary with consensus labels as keys and zero-indexed row-indices of self.dataframe, as values so that e.g. * self.conlab2AA[“3.50”] -> ‘R131’ or * self.conlab2AA[“G.hfs2.2”] -> ‘R201’
- conlab2residx(top, map=None, **top2labels_kwargs)
Returns a dictionary keyed by consensus labels and valued by residue indices of the input topology in top.
The default behaviour is to internally align top with the object’s available consensus dictionary on the fly using
top2labels
. See the docs there for **top2labels_kwargs, in particular restrict_to_residxs, keep_consensus, and min_seqID_rateNote
This method is able to work with a new topology every time, performing a sequence alignment every call. The intention is to instantiate a
LabelerConsensus
just one time and use it with as many topologies as you like without changing any attribute ofself
.HOWEVER, if you know what you are doing, you can provide a list of consensus labels yourself using map. Then, this method is nothing but a table lookup (almost)
Warning
No checks are performed to see if the input of map actually matches the residues of top in any way, so that the output can be rubbish and go unnoticed.
- Parameters:
top (
Topology
)map (list, default is None) – A pre-computed residx2consensuslabel map, i.e. the output of a previous, external call to
_top2consensus_map
If it contains duplicates, it is a malformed list. See the note above for more infotop2labels_kwargs (dict) – Optional parameters for
top2labels
, which are listed below
- Other Parameters:
allow_nonmatch (bool, default is True) – Use consensus labels for non-matching positions in case the non-matches have equal lengths
autofill_consensus (boolean default is False) – Even if there is a consensus mismatch with the sequence of the input
AA2conlab_dict
, try to relabel automagically, s.t.[‘G.H5.25’, ‘G.H5.26’, None, ‘G.H.28’]
- will be relabeled as
[‘G.H5.25’, ‘G.H5.26’, ‘G.H.27’, ‘G.H.28’]
min_seqID_rate (float, default is .5) – With big topologies and many fragments, the alignment method (
mdciao.sequence.my_bioalign
) sometimes yields sub-optimal results. A valuemin_seqID_rate
>0, e.g. .5 means that a pre-alignment takes place to populaterestrict_to_residxs
with indices of those the fragments (mdciao.fragments.get_fragments
defaults) with more than 50% alignment in the pre-alignment.restrict_to_residxs (iterable of integers, default is None) – Use only these residues for alignment and labelling purposes. Helps “guide” the alignment method. E.g., for big topologies the the alignment might find some small matches somewhere and, in some corner cases, match those instead of the desired ones. Here, one can pass residues indices defining the topology segment wherein the match should be contained to.
fragments (str, iterable, None, or bool, default is ‘resSeq’) – Fragment definitions to resolve situations where two (or more) alignments share the optimal alignment score. Consider aligning an input sequence ‘XXLXX’ to the object’s sequence ‘XXLLXX’. There are two equally scored alignments:
XXL XX XX LXX ||| || vs || ||| XXLLXX XXLLXX
In order to choose between these two alignments, it’s checked which alignment observes the fragment definition passed here. This definition can be passed explicitly as iterable of integers or implicitly as a fragmentation heuristic, which will be used by
mdciao.fragments.get_fragments
on the top. So, if e.g. the input ‘XXLXX’ sequence is fragmented (explicitly or implicitly) into [XX],[LXX], then the second alignment will be chosen, given that it respects that fragmentation.Note
fragments only has an effect if both
the top is an actual
Topology
carrying the sequence
indices, since if top is a sequence string, then there’s no fragmentation heuristic possible.
two or more alignments share the optimal alignment score
The method avoids breaking the consensus definitions across the input fragments, while also providing consensus definitions for those other residues not present in fragments. This is done by using ‘resSeq’ to infer the missing fragmentation. This keeps the functionality of respecting the original fragments while also providing consensus fragmentation other parts of the topology. For compatibility with other methods, passing fragments=None will still use the fragmentation heuristic (this might change in the future). To explicitly circumvent this forced fragmentation and subsequent check, use `fragments=False`. This will simply use the first alignment that comes out of
mdciao.utils.sequence.my_bioalign
, regardless of there being other, equally scored, alignments and potential clashes with sensitive fragmentations.verbose (boolean, default is False) – be verbose
- Returns:
dict
- Return type:
keyed by consensus labels and valued with residue idxs
- property dataframe: DataFrame
DataFrame
summarizing this object’s information- Returns:
df
- Return type:
- property fragment_names
Name of the fragments according to the consensus labels
TODO OR NOT? Check!
- property fragments
Dictionary of fragments keyed with fragment names and valued with the residue names (AAresSeq) in that fragment.
- property fragments_as_conlabs
Dictionary of fragments keyed with fragment names and valued with the consensus labels in that fragment
- property fragments_as_idxs
Dictionary of fragments keyed with fragment names and valued with idxs of the first column of self.dataframe, regardless of these residues having a consensus label or not
- Returns:
fragments_as_idxs
- Return type:
dict
- property fragments_as_resSeqs: dict
Dictionary of fragments keyed with fragment names and valued with the residue sequence indices (resSeq) in that fragment
- Returns:
fragments_as_resSeqs
- Return type:
dict
- property idx2conlab
List of consensus labels in the order (=idx) they appear in the original dataframe
This index is the row-index of the table, don’t count on it being aligned with anything
- property most_recent_alignment: DataFrame
A
DataFrame
with the most recent alignmentExpert use only
- Returns:
df
- Return type:
- property most_recent_top2labels
The most recent
self.top2labels
-resultExpert use only
- Returns:
df
- Return type:
list
- property tablefile
The file used to instantiate this transformer
- top2frags(top, fragments=None, min_seqID_rate=0.5, input_dataframe=None, show_alignment=False, atoms=False, verbose=True) dict
Return the subdomains derived from the consensus nomenclature and map it out in terms of residue indices of the input top
Note
This method uses aligntop internally, see the doc on that method for more info.
- Parameters:
top –
Topology
or path to topology file (e.g. a pdb)fragments (iterable of integers, default is None) – Any useful fragment definition as lists of residue indices. Useful means:
Help with the alignment needed for consensus fragment definition. Look at
LabelerConsensus.aligntop
and its fragments and min_seqID_rate parameters.Check if the newly found, consensus fragment definitions (defs) clash with the input in fragments. Clash* means that the defs would span over more than one of the fragments in defined in fragments.
An interactive prompt will ask the user which fragments to keep in case of clashes.
Check the method
check_if_fragment_clashes
for more info.min_seqID_rate (float, default is .5) – With big topologies, like a receptor-Gprotein system, the “brute-force” alignment method (check
mdciao.sequence.my_bioalign
) sometimes yields sub-optimal results, e.g. finding short snippets of reference sequence that align in a completely wrong part of the topology. To avoid this, an initial, exploratory alignment is carried out.min_seqID_rate
= .5 means that only the fragments (mdciao.fragments.get_fragments
defaults) with more than 50% alignment in this exploration are used to improve the second alignmentinput_dataframe (
DataFrame
, default is None) – Expert option, use at your own risk. Instead of aligningtop
to the object’s sequence to derive fragment definitions, input an existing alignment here, e.g. the self.most_recent_alignmentshow_alignment (bool, default is False,) – Show the entire alignment as
DataFrame
atoms (bool, default is False) – Instead of returning residue indices, return atom indices
verbose (bool, default is True) – Also print the definitions
- Returns:
defs – Dictionary with subdomain names as keys and arrays of indices (residue or atom) as values
- Return type:
dictionary
- top2labels(top, allow_nonmatch=True, autofill_consensus=True, min_seqID_rate=0.5, **aligntop_kwargs) list
Align the sequence of
top
to the sequence used to initialize thisLabelerConsensus
and return a list of consensus labels for each residue intop
.Populates the attributes
most_recent_top2labels
andmost_recent_alignment
If a consensus label is returned as None it means one of two things:
this position was successfully aligned with a match but the data used to initialize this
ConsensusLabeler
did not contain a labelthis position has a label in the original data but the sequence alignment is not matched (e.g., bc of a point mutation)
To remedy the second case a-posteriori two things can be done:
recover the original label even though residues did not match, using
allow_nonmatch
. Seealignment_df2_conslist
for more inforeconstruct what the label could be using a heuristic to “autofill” the consensus labels, using
autofill_consensus
. See_fill_consensus_gaps
for more info.
Note
This method uses
aligntop
internally, see the doc on that method for more info.- Parameters:
top (
Topology
object or str) – The topology as an object or a path to a filename, e.g. a pdb file.allow_nonmatch (bool, default is True) – Use consensus labels for non-matching positions in case the non-matches have equal lengths
autofill_consensus (boolean default is False) – Even if there is a consensus mismatch with the sequence of the input
AA2conlab_dict
, try to relabel automagically, s.t.[‘G.H5.25’, ‘G.H5.26’, None, ‘G.H.28’]
- will be relabeled as
[‘G.H5.25’, ‘G.H5.26’, ‘G.H.27’, ‘G.H.28’]
min_seqID_rate (float, default is .5) – With big topologies and many fragments, the alignment method (
mdciao.sequence.my_bioalign
) sometimes yields sub-optimal results. A valuemin_seqID_rate
>0, e.g. .5 means that a pre-alignment takes place to populaterestrict_to_residxs
with indices of those the fragments (mdciao.fragments.get_fragments
defaults) with more than 50% alignment in the pre-alignment.aligntop_kwargs (dict) – Optional parameters for
aligntop
, which are listed below
- Other Parameters:
restrict_to_residxs (iterable of integers, default is None) – Use only these residues for alignment and labelling purposes. Helps “guide” the alignment method. E.g., for big topologies the the alignment might find some small matches somewhere and, in some corner cases, match those instead of the desired ones. Here, one can pass residues indices defining the topology segment wherein the match should be contained to.
min_seqID_rate (float, default .5) – With big topologies and many fragments, the alignment method (
mdciao.sequence.my_bioalign
) sometimes yields sub-optimal results. A valuemin_seqID_rate
>0, e.g. .5 means that a pre-alignment takes place to populaterestrict_to_residxs
with indices of those the fragments (mdciao.fragments.get_fragments
defaults) with more than 50% alignment in the pre-alignment. Ifmin_seqID_rate
>0, :obj`restrict_to_residx` has to be None.fragments (str, iterable, None, or bool, default is ‘resSeq’) – Fragment definitions to resolve situations where two (or more) alignments share the optimal alignment score. Consider aligning an input sequence ‘XXLXX’ to the object’s sequence ‘XXLLXX’. There are two equally scored alignments:
XXL XX XX LXX ||| || vs || ||| XXLLXX XXLLXX
In order to choose between these two alignments, it’s checked which alignment observes the fragment definition passed here. This definition can be passed explicitly as iterable of integers or implicitly as a fragmentation heuristic, which will be used by
mdciao.fragments.get_fragments
on the top. So, if e.g. the input ‘XXLXX’ sequence is fragmented (explicitly or implicitly) into [XX],[LXX], then the second alignment will be chosen, given that it respects that fragmentation.Note
fragments only has an effect if both
the top is an actual
Topology
carrying the sequence
indices, since if top is a sequence string, then there’s no fragmentation heuristic possible.
two or more alignments share the optimal alignment score
The method avoids breaking the consensus definitions across the input fragments, while also providing consensus definitions for those other residues not present in fragments. This is done by using ‘resSeq’ to infer the missing fragmentation. This keeps the functionality of respecting the original fragments while also providing consensus fragmentation other parts of the topology. For compatibility with other methods, passing fragments=None will still use the fragmentation heuristic (this might change in the future). To explicitly circumvent this forced fragmentation and subsequent check, use `fragments=False`. This will simply use the first alignment that comes out of
mdciao.utils.sequence.my_bioalign
, regardless of there being other, equally scored, alignments and potential clashes with sensitive fragmentations.verbose (boolean, default is False) – be verbose
- Returns:
map – List of len = top.n_residues with the consensus labels
- Return type:
list