mdciao.cli.compare

mdciao.cli.compare(datasets, graphic_ext='.pdf', output_desc='freq_comparison', pop=False, **kwargs)

Compare contact frequencies across different sets of data

Parameters:
  • datasets (iterable (list or dict)) – The datasets to compare with each other. If dict, then the keys will be used as names for the contact groups, e.g. “WT”, “MUT” etc. If list, then the keys will be auto-generated. The entries of the list/dictionary can be:

    • ContactGroup objects. For these, a ctc_cutoff_Ang value needs to be passed along, otherwise frequencies cannot be computed on-the-fly.

    • dictionaries where the keys are residue-pairs, one letter-codes, no fragment info, as in mdciao.contacts.ContactGroup.ctc_labels_short and the values are contact frequencies

    • files generated by (or in the same format as) frequency_table

      • ascii-files with the contact the frequencies in the first column and labels in the second and/or third column, see frequency_str_ASCII_file and freq_ascii2dict

      • .xlsx files with the header in the second row, containing at least the column-names “label” and “freqs”, see frequency_spreadsheet

  • graphic_ext (str, default is “.pdf”) – The extension for figures

  • output_desc (str, default is ‘freq_comparison’) – Descriptor for output files.

  • pop (bool, default is True) – Use show to force the figure to be drawn.

  • kwargs (dict) – Optional arguments for compare_groups_of_contacts, which are listed below:

Other Parameters:
  • colors (iterable (list or dict), or str, default is None) –

  • mutations_dict (dictionary, default is None) – A mutation dictionary that allows to plot together residues that would otherwise be identified as different contacts. If there were two mutations, e.g A30K and D35A the mutation dictionary will be {“A30”:”K30”, “D35”:”A35”}. You can also use this parameter for correcting indexing offsets, e.g {“GDP395”:”GDP”, “GDP396”:”GDP”}.

  • width (float, default is .2) – The witdth of the bars

  • ax (Axes or array thereof, default is None) – The default is to let the method draw its own figure and axis, but you can pass pre-exisintg axis here. If distro is False, it means only one axis is needed, so you can pass the axis object direclty here. If distro is True, a subplot is needed, where each panel contains the distributions of each contact. Hence, pass an array of axis if distro is True. See mdciao.plots.plot_unified_distro_dicts for more info (in particular ax_array).

  • figsize (tuple, default is (10,5)) – The figure size in inches, in case it is instantiated automatically by not passing an ax

  • fontsize (float, default is 16) – The fontsize to use

  • anchor (str, default is None) – This string will be deleted from the contact labels, leaving only the partner-residue to identify the contact. The deletion takes place after the mutations_dict has been applied. The final anchor label will be that of the deleted keys (allows for keeping e.g. pre-existing consensus nomenclature). No consistency-checks are carried out, i.e. use at your own risk

  • plot_singles (bool, default is False) – Produce one extra figure with as many subplots as systems in dictionary_of_groups, where each system is plotted separately. The labels used will have been already “mutated” using mutations_dict and “anchored” using anchor. This plot is temporary and cannot be saved Needed value to compute frequencies on-the-fly if the input was using ContactGroup objects

  • AA_format (str, default is “short”) – see frequency_dict for more info

  • defrag (str, default is “@”) – see unify_freq_dicts for more info

  • per_residue (bool, default is False) – Unify dictionaries by residue and not by pairs. If True, remove_identities is set to False automatically when calling plot_unified_freq_dicts

  • title (str, default is “comparison”) – The title for the plot

  • distro (bool, default is False) – Instead of plotting contact frequencies, plot contact distributions

  • interface (bool, default is False) – Sorts the residues into interface fragments. Will fail if the passed groups don’t have self.is_interface==True It enforces a per-residue view, plotting a single bar per residue indicating in how many contacts that residue participates in. See below ‘sort_by’ for how these residues get sorted within their respective interface fragments.

  • n_cols (int, default is 1) – Only has effect if distro is True. The number of columns in the multi-panel figure with the per-contact distributions.

  • sharex (bool, or string, default is False) – Only has effect if distro is True. Can be True or “col”, for sharing the x-axis across columns. See subplots for more info. Only has an effect if ax is None.

  • colordict (dict, default is None.) – What color each system gets. Default is some sane matplotlib values

  • panelheight_inches (int, default is 5) – The height of the panel, in inches. Determines the figure size if figsize is None, else has no effect

  • inch_per_contacts (int, default is 1) – How many inches each contact-pair is given in the panel. Determines the figure size if figsize is None, else has no effect

  • sort_by (str or list of strings, default is “mean”) – If str, the property by which to sort the contacts. If list, the list of contact labels in the order in which they will be shown. If str, the possibilities are

    • “mean” sort (descending) by mean frequency over all systems, making most frequent contacts appear on the left/top of the plot.

    • “std” sort (descending) by per-contact standard deviation over all systems, making the contacts with most different values appear on top. This highlights more “deviant” contacts and might hence be more informative than “mean” in cases where a lot of contacts have similar frequencies (high or low). If this option is activated, a faint dotted line is incorporated into the plot that marks the std for each contact group

    • “keep” keep the contacts in whatever order they have in the first dictionary

    • “numeric” sort (ascending) the contacts by the first number

    that appears in the contact labels, e.g. “30” if the label is “GLU30@3.50-GDP”. You can use this to order by resSeq if the AA to sort by is the first one of the pair. Contact labels without numbers in them will be sorted alphabetically at the end of the labels with numbers.

    • “residue” alias for “numeric”

    • list of contact-labels : sort in the order established by this list. What will actually be plotted is the intersection of this list and the available contact labels of freqs after other parameters like lower_cutoff_val or identity_cutoff have taken effect, e.g. if a contact-label is discarded because of lower_cutoff_val, adding the label to this list won’t have any effect.

  • lower_cutoff_val (float, default is 0) – Hide contacts with small values. “values” changes meaning depending on sort_by. If sort_by is any of

    • “mean”, “keep”, “numeric”, “residue” or a list, then the contacts where all systems have frequencies lower than this value are hidden.

    • “std”, then the contacts where the standard deviation across systems itself is lower than this value are hidden. This hides contacts where all systems are similar, regardless of whether they’re all around 1, around .5 or around 0

  • remove_identities (bool, default is False) – If True, the contacts where freq[sys][ctc] >= identity_cutoff across all systems will not be plotted nor considered in the sum over contacts TODO : the word identity might be confusing

  • vertical_plot (bool, default is False) – Plot the bars vertically in descending sort_by instead of horizontally (better for large number of frequencies)

  • identity_cutoff (float, default is 1) – If remove_identities, use this value to define what is considered an identity, s.t. contacts with values e.g. .95 can also be removed TODO consider merging both identity parameters into one that is None or float

  • assign_w_color (boolean, default is False) – Color the text of the contact-labels according to the following criterion.

    • If all frequencies are below the lower_cutoff_val except for one system, then the label adopts the color of this system and gets prepended with a “+” sign.

    • If all frequencies are above the lower_cutoff_val except for one system, then the label adopts the color of this system and gets prepended with a “-” sign

    For more details see the paragraph “Visual Aides” of this notebook

  • legend_rows (int, default is 4) – The maximum number of rows per column of the legend. If you have 10 systems, :obj:`legend_rows`=5 means you’ll get two columns, =2 means you’ll get five.

  • verbose_legend (bool, default is True) – Verbose legends inform about contacts that were in the input but have been left out of the plot. Contacts are left out if they are:

    • above the identity_cutoff or

    • below the lower_cutoff_val

    They will appear in the verbose legend as “+ A.a + B.b”, respectively denoting the missing contacts that are “a(bove” and b(elow)” with their respective sums “A” and “B”.

  • half_sigma (bool, default is False) – When True, instead of showing Sigma=20, Sigma = 2x10 will be shown. If a ContactGroup has a Sigma=10 normally, when showing per-residue values, that number doubles, because each contact is shown two times. Hence, showing half-sigma allows to “keep” the number 10 in the legend, even though the shown Sigma is 20

Returns:

  • myfig (Figure) – Figure with the comparison plot

  • freqs (dictionary) – Unified frequency dictionaries, including mutations and anchor

  • plotted_freqs (dictionary) – Like freqs but sorted and purged according to the user-defined input options, s.t. it represents the plotted values