mdciao.cli.residue_neighborhoods

mdciao.cli.residue_neighborhoods(residues, trajectories, topology=None, res_idxs=False, ctc_cutoff_Ang=3.5, stride=1, ctc_control=5, n_nearest=4, scheme='closest-heavy', chunksize_in_frames=10000, nlist_cutoff_Ang=15, n_smooth_hw=0, sort=True, pbc=True, ylim_Ang=15, fragments=['lig_resSeq+'], fragment_names='', fragment_colors=None, graphic_ext='.pdf', table_ext='.dat', GPCR_uniprot=None, CGN_PDB=None, KLIFS_uniprotAC=None, output_dir='.', output_desc='neighborhood', t_unit='ns', curve_color='auto', background=True, graphic_dpi=150, short_AA_names=False, allow_same_fragment_ctcs=True, save_nomenclature_files=False, plot_timedep=True, n_cols=4, distro=False, n_jobs=1, separate_N_ctcs=False, accept_guess=False, switch_off_Ang=None, plot_atomtypes=False, no_disk=False, savefigs=True, savetabs=True, savetrajs=False, figures=True, pre_computed_distance_matrix=None, naive_bonds=False)

Per-residue neighborhoods based on contact frequencies between pairs of residues.

A neighborhood is a mdciao.contacts.ContactGroup-object containing a set of mdciao.contacts.ContactPair-objects with a shared residue, called the anchor_residue.

The contact frequencies will be printed, plotted and saved. The residue-residue distance time-traces used for their computation will be also returned

Note

The time-independent figures (e.g. “neighborhood.overall@3.5_Ang.pdf”) are always shown whereas the time-dependent figures (e.g. “neighborhood.GDP395.time_trace@3.5_Ang.pdf”) are never shown, because the number of time-traces becomes very high very quickly. It’s easier to look at them with an outside viewer.

The user may be prompted when necessary, although this behaviour can be turned off with accept_guess

Input can be from disk and/or from memory (see below).

Can be parallelized up to the number of used trajectories.

Many other optional parameters are exposed to allow fine-tuning of the computing, plotting, printing, and saving. Additional information can be regarding nomenclature, fragmentation heuristics and/or naming and or/coloring, residue labeling, time-trace averaging, data-streaming,

Parameters
  • residues (int, iterable of ints or str) –

    The residue(s) for which the neighborhood will be computed. This input is pretty flexible wrt to strings and numbers, which are interpreted as sequence indices unless res_idxs is True Valid inputs are are:

    • residues = [1,10,11,12]

    • residues = ‘1,10,11,12’

    • residues = ‘1,10-12’

    • residues = [1]

    • residues = 1

    • residues = ‘1’

    • residues = ‘1,10-12,GLU*,GDP*,E30’

    Please refer to mdciao.utils.residue_and_atom.rangeexpand_residues2residxs for more info

  • trajectories (str, mdtraj.Trajectory, or None) –

    The MD-trajectories to calculate the frequencies from. This input is pretty flexible. For more info check mdciao.utils.str_and_dict.get_sorted_trajectories. Accepted values are:

  • topology (str or Trajectory, default is None) – The topology associated with the trajectories If None, the topology of the first trajectory will be used, i.e. when no topology is passed, the first trajectory has to be either a .gro or .pdb file, or an Trajectory object

Other Parameters
  • res_idxs (bool, default is False) –

    Whether the indices of residues should be understood as
    • zero-indexed, residue serial indices or

    • residue sequence, eg. 30 in GLU30, this is called ‘resSeq’

    in an mdtraj.core.Residue-object

  • ctc_cutoff_Ang (float, default is 3.5) – Any residue-residue distance is considered a contact if d<=ctc_cutoff_Ang

  • stride (int, default is 1) – Stride the input data by this number of frames

  • ctc_control (int or float, default is 5) – Control the number of reported contacts. Can be an integer (keep the first n contacts) or a float representing a fraction [0,1] of the total number of contacts. Default is 5.

  • n_nearest (int, default is 4) – Exclude these many bonded neighbors for each residue

  • scheme (str, default is ‘closest-heavy’) – Type of scheme for computing distance between residues. Choices are {‘ca’, ‘closest’, ‘closest- heavy’, ‘sidechain’, ‘sidechain-heavy’}. See mdtraj.compute_distances documentation for more info

  • chunksize_in_frames (int, default is 10000) – Stream through the trajectory data in chunks of this many frames Can lead to memory errors if n_jobs makes it so that e.g. 4 trajectories of 10000 frames each are loaded to memory and their residue-residue distances computed

  • nlist_cutoff_Ang (int, default is 15) – Before computing the residue-residue distance for all frames, neighbor-list is created, for each residue, that includes the residues up to nlist_cutoff_Ang from the residue. Increase this parameters (e.g. to 30) if you expect large conformational changes and/or the geometry in topology. Setting this cutoff to None is equivalent to using no cutoff, i.e. all possible contacts are regarded

  • n_smooth_hw (int, default is 0) – Plots of the time-traces will be smoothed using a window of 2*n_smooth_hw

  • sort (bool, default is True) – Sort the input residues according to their indices

  • pbc (bool, default is True) – Use periodic boundary conditions

  • ylim_Ang (float, default is 15) – Limit in Angstrom of the y-axis of the time-traces. Default is 15. Switch to any other float or ‘auto’ for automatic scaling

  • fragments (list, default is [“lig_resSeq+”]) – Fragment control. For compatibility reasons, it has to be a list, even if it only has one element. There exist several input modes:

    • [“consensus”] : use things like “TM*” or “G.H*”, i.e.

    GPCR or CGN-sub-subunit labels.

    • List of len 1 with some fragmentation heuristic, e.g.

    [“lig_resSeq+”]. will use the default of mdciao.fragments.get_fragments. See there for info on defaults and other heuristics.

    • List of len N that can mix different possibilities: * iterable of integers (lists or np.arrays, e.g. np.arange(20,30) * ranges expressed as integer strings, “20-30” * ranges expressed as residue descriptors [“GLU30-LEU40”]

    Numeric expressions are interepreted as zero-indexed and unique residue serial indices, i.e. 30-40 does not necessarily equate “GLU30-LEU40” unless serial and sequence index coincide. If there’s more than one “GLU30”, the user gets asked to disambiguate. The resulting fragments need not cover all of the topology, they only need to not overlap.

  • fragment_names (string or list of strings, default is “”) – If string, it has to be a list of comma-separated values. If you want unnamed fragments, use None, “None”, or “”. Has to contain names for all fragments that result from fragments or more. mdciao wil try to use replace4latex to generate LaTeX expressions from stuff like “Galpha” You can use fragment_names=”None” or “” to avoid using fragment names

  • fragment_colors (None, boolean or list, default is None) – Assign colors to fragments. These colors will be used to color-code the frequency bars. If True, colors will be automatically selected, otherwise picked from the list. Use with cautions, plots get shrill quickly

  • graphic_ext (str, default is “.pdf”) – The extension (=format) of the saved figures

  • table_ext (str, default is “.dat”) – The extension (=format) of the saved tables

  • GPCR_uniprot (str or mdciao.nomenclature.LabelerGPCR, default is None) – For GPCR nomenclature. If str, e.g. “adrb2_human”. will try to locate a local filename or do a web lookup in the GPCRdb. If mdciao.nomenclature.LabelerGPCR, use this object directly (allows for object re-use when in API mode). See mdciao.nomenclature for more info and references. Please note the difference between UniProt Accession Code and UniProt entry name as explained here .

  • CGN_PDB (str or mdciao.nomenclature.LabelerCGN, default is None) – For CGN (G-alpha Numbering definitions) nomenclature. If str, e.g. “3SN6”, try to locate local filenames (“3SN6.pdb”, “CGN_3SN6.txt”) or do web lookups in https://www.mrc-lmb.cam.ac.uk/CGN/ and http://www.rcsb.org/. If mdciao.nomenclature.LabelerCGN, use this object directly (allows for object re-use when in API mode) See mdciao.nomenclature for more info and references.

  • KLIFS_uniprotAC (str or mdciao.nomenclature.LabelerKLIFS, default is None) – Uniprot Accession Code for kinase KLIFS nomenclature. If str, e.g. “P31751”, try to locate a local filename or do a web lookup in the GPCRdb. If mdciao.nomenclature.LabelerKLIFS, use this object directly (allows for object re-use when in API mode). See mdciao.nomenclature for more info and references. Please note the difference between UniProt Accession Code and UniProt entry name as explained here .

  • output_dir (str, default is ‘.’) – directory to which the results are written.

  • output_desc (str, default is ‘neighborhood’) – Descriptor for output files.

  • t_unit (str, default is ‘ns’) – Unit used for the temporal axis.

  • curve_color (str, default is ‘auto’) – Type of color used for the curves. Alternatives are “P” or “H”

  • background (bool, or color-like, (str, hex, rgb), default is True) – When smoothing, the original curve can appear in the background in different colors * True: use a fainted version of color * False: don’t plot any background * color-like: use this color for the background,

    can be: str, hex, rgba, anything matplotlib.pyplot.colors understands

  • graphic_dpi (int, default is 150) – Dots per Inch (DPI) of the graphic output. Only has an effect for bitmap outputs.

  • short_AA_names (bool, default is False) – Use one-letter aminoacid names when possible, e.g. K145 insted of Lys145.

  • allow_same_fragment_ctcs (bool, default is True) – Allow contacts whithin the same fragment.

  • save_nomenclature_files (bool, default is False) – Save available nomenclature definitions to disk so that they can be accessed locally in later uses.

  • plot_timedep (bool, default is True) – Plot and save time-traces of the contacts

  • n_cols (int, default is 4) – number of columns of the overall plot.

  • distro (bool, default is False) – Plot distance distributions instead of contact bar plots

  • n_jobs (int, default is 1) – Number of processors to use. The parallelization is done over trajectories and not over contacts, beyond n_jobs>n_trajs parallelization will not have any effect.

  • separate_N_ctcs (bool, default is False) – Separate the plot with the total number contacts from the time-trace plot.

  • accept_guess (bool, default is False) – Accept mdciao’s guesses regarding fragment identification using nomenclature labels

  • switch_off_Ang (NoneType, default is None) – Use a linear switchoff instead of a crisp one.

  • plot_atomtypes (bool, default is False) – Add the atom-types to the frequency bars by ‘hatching’ them. ‘–’ is sidechain-sidechain ‘|’ is backbone-backbone ‘’ is backbone-sidechain ‘/’ is sidechain-backbone. See Fig XX for an example

  • savefigs (bool, default is True) – Save the figures

  • savetabs (bool, default is True) – Save the frequency tables

  • savetrajs (bool, default is False) – Save the timetraces

  • no_disk (bool, default is False) – If True, don’t save any files at all: figs, tables, trajs, nomenclature

  • figures (bool, default is True) – Draw figures

  • pre_computed_distance_matrix ((m,m) np.ndarray, default is None) – The distance matrix here will speed up the pre-computing of likely neighbors. Usecase are several API-calls following each other

  • naive_bonds (bool, default is False) – If top doesn’t automatically yield a list bonds between residues, build naive (=linear) bonds using mdciao.utils.bonds.top2residue_bond_matrix_naive These bonds are needed to exclude bonded neighbors using n_nearest

Returns

out_dict

  • neighborhoods : dictionary keyed by unique, zero-indexed residue indices.

The values are mdciao.contacts.ContactGroup objects

  • ctc_idxs : 2D np.ndarray with the residue indices of the contact pairs within obj`:nlist_cutoff_Ang` in at least one frame

  • ctcs_trajs : list of per-traj 2D np.ndarrays with the mindist between the residues of “ctc_idxs”

  • time_array : list of per-traj time-arrays

Usually, only neighborhoods is usefull, other entries are there for debugging

Return type

dict