mdciao.utils.str_and_dict

Functions for manipulating strings and dictionaries, also a bit of IO.

Functions

Functions

average_freq_dict(freqs[, weights])

Average frequencies (or anything) over dictionaries.

choose_options_descencing(options[, fmt, …])

Return the first entry that’s acceptable according to some rule

defrag_key(key[, defrag, sep])

Remove fragment information from a contact label

delete_exp_in_keys(idict, exp[, sep])

Assuming the keys in the dictionary are formed by two segments joined by a separator, e.g.

delete_pattern_in_ctc_label(pattern, label, sep)

df_str_formatters(df)

Return formatters for :obj:`~pandas.DataFrame.to_string’

fnmatch_ex(patterns_as_csv, list_of_keys)

Match the keys in list_of_keys against some naming patterns using Unix filename pattern matching TODO include link: https://docs.python.org/3/library/fnmatch.html

freq_ascii2dict(ifile[, comment])

Reads an ASCII file that contains contact frequencies (1st column) and contact labels (2nd and/or 3rd column).

freq_file2dict(ifile[, defrag])

Read a file containing the frequencies (“freq”) and labels (“label”) of pre-computed contacts

get_sorted_trajectories(trajectories)

Common parser for something that can be interpreted as a trajectory

inform_about_trajectories(trajectories[, …])

Return a string that informs about the trajectories

intblocks_in_str(istr)

Return the integers that appear as contiguous blocks in strings

iterate_and_inform_lambdas(ixtc, chunksize)

Given a trajectory (as object or file), returns a strided, chunked iterator and function for progress report

latex_mathmode(istr[, enclose])

Prepend symbol words with “\ ” and protect non-symbol words with ‘\mathrm{}’

latex_superscript_fragments(contact_label[, …])

Format fragment descriptors as Latex math-mode superscripts

lexsort_ctc_labels(ctc_labels[, reverse, …])

Sort contact-labels in ascending order of resSeq using both columns

match_dict_by_patterns(patterns_as_csv, …)

Joins all the values in an input dictionary if their key matches some patterns.

order_key(key, sep)

replace4latex(istr[, sindex, symbols, …])

Return a string where symbols and super/sub-indices have been prepared for LaTeX

replace_w_dict(input_str, exp_rep_dict)

Sequentially perform string replacements on a string using a dictionary

sort_dict_by_asc_values(idict[, reverse])

Sort a dictionary by values

splitlabel(label[, sep, defrag])

Split a contact label.

sum_dict_per_residue(idict, sep)

Return a “per-residue” sum of values from a “per-residue-pair” keyed dictionary

unify_freq_dicts(freqs[, exclude, …])

Provided with a dictionary of dictionaries, returns an equivalent, key-unified dictionary where all sub-dictionaries share their keys, putting zeroes where keys where absent originally.

Classes

FilenameGenerator(output_desc, …)

Generate per project filenames when you need them

class mdciao.utils.str_and_dict.FilenameGenerator(output_desc, ctc_cutoff_Ang, output_dir, graphic_ext, table_ext, graphic_dpi, t_unit)

Generate per project filenames when you need them

This is a WIP to consolidate all filenaming in one place, s.t. all sanitizing and project-specific naming operations happen here and not in the cli methods

A named tuple would’ve been enough, but we need some

methods for dynamic naming (e.g. per-residue or per-traj)

mdciao.utils.str_and_dict.average_freq_dict(freqs, weights=None, **unify_freq_dicts_kwargs)

Average frequencies (or anything) over dictionaries.

Typically, the input freqs are keyed first by system, then by contact label, e.g. {“T300”:{“GDP-R201”:1.0},

“T320”:{“GDP-R201”:.25}, “MUT”:{“GDP-L201”:25}}

The input data need not be unified, the method calls unify_freq_dicts internally. In the example above you have to call it with the arg replacement_dict={“L201:R201”} so tha it can understand that mutation when unifying

Parameters
  • freqs (dict of dicts) – The dictionaries containing frequence dictionaries,

  • weights (dict, default is None) – relative weights of each dictionary

  • unify_freq_dicts_kwargs

Returns

averaged_dict – an averaged dictionary keyed only with the

Return type

dict

mdciao.utils.str_and_dict.choose_options_descencing(options, fmt='%s', dont_accept=['none', 'na'])

Return the first entry that’s acceptable according to some rule

If no is found, “” is returned :param options: :type options: list :param fmt: You can specify a different

format here. Will only apply in case something is returned

Parameters

dont_accept (list) – Move down the list if current item is one of these

Returns

best – Either the best entry in options or “” if no option was found

Return type

str

mdciao.utils.str_and_dict.defrag_key(key, defrag='@', sep='-')

Remove fragment information from a contact label

Parameters
  • key (str) – Contact label with some sort of pair information e.g. e.g. R1@frag1-E2@frag2->R1-E2

  • defrag (char, default is "@") – Character that indicates the beginning of the fragment

  • sep (char, default is "-") – Character that indicates the separation between first and second residue of the pair

mdciao.utils.str_and_dict.delete_exp_in_keys(idict, exp, sep='-')

Assuming the keys in the dictionary are formed by two segments joined by a separator, e.g. “GLU30-ARG40”, deletes the segment containing the input expression, exp

Will fail if not all keys have the expression to be deleted

Parameters
  • idict (dictionary) –

  • exp (str) –

  • sep (str, default is "-",) –

Returns

  • dict – dictionary with the same values but the keys lack the segment containing exp

  • dhk (list) – List with the deleted half-keys

mdciao.utils.str_and_dict.df_str_formatters(df)

Return formatters for :obj:`~pandas.DataFrame.to_string’

In principle, this should be solved by https://github.com/pandas-dev/pandas/issues/13032, but I cannot get it to work

Parameters

df (DataFrame) –

Returns

formatters – Keyed with df-keys and valued with lambdas s.t. formatters[key][istr]=formatted_istr

Return type

dict

mdciao.utils.str_and_dict.fnmatch_ex(patterns_as_csv, list_of_keys)

Match the keys in list_of_keys against some naming patterns using Unix filename pattern matching TODO include link: https://docs.python.org/3/library/fnmatch.html

This method also allows for exclusions (grep -e)

TODO: find out if regular expression re.findall() is better

Uses fnmatch under the hood

Parameters
  • patterns_as_csv (str) – Patterns to include or exclude, separated by commas, e.g. * “H*,-H8” will include all TMs but not H8 * “G.S*” will include all beta-sheets

  • list_of_keys (list) – Keys against which to match the patterns, e.g. * [“H1”,”ICL1”, “H2”…”ICL3”,”H6”, “H7”, “H8”]

Returns

matching_keys

Return type

list

mdciao.utils.str_and_dict.freq_ascii2dict(ifile, comment=['#'])

Reads an ASCII file that contains contact frequencies (1st column) and contact labels (2nd and/or 3rd column). Columns are separated by tabs or spaces.

Contact labels have to come after the frequency in the form of “res1 res2, “res1-res2” or “res1 - res2”,

Columns other than the frequencies and the residue labels are ignored.

Examples

File produced by mdciao:

>>> #freq              label              residue idxs  sum
>>> 0.59 R389@G.H5.21    - L394@G.H5.26    348 353    0.59
>>> 0.46 L394@G.H5.26    - K270@6.32x32    353 972    1.05
>>> 0.34 L388@G.H5.20    - L394@G.H5.26    347 353    1.39
>>> 0.32 L394@G.H5.26    - L230@5.69x69    353 957    1.71
>>> 0.04 R385@G.H5.17    - L394@G.H5.26    344 353    1.75

Minimal file with mixed labeling

>>> 1 ALA30-GLU50
>>> .5 ASP31 - GLU51
>>> .1 ASP31 GLU50

TODO use pandas to allow more flex, not needed for the moment

Parameters
  • ifile (str) – The filename to be read

  • comment (list of chars) – Any line starting with any of these characters will be ignored

Returns

freqdict – Keys are “res1-res2” (regardless of input) and values are freqs

Return type

dictionary

mdciao.utils.str_and_dict.freq_file2dict(ifile, defrag=None)

Read a file containing the frequencies (“freq”) and labels (“label”) of pre-computed contacts

Parameters
  • ifile (str) – Path to file, can be a .xlsx, .dat, .txt

  • defrag (str, default is None) – If passed a string, e.g “@”, the fragment information of the contact label will be deleted upon reading, so that R131@frag1 becomes R131. This is done by calling defrag_key internally

Returns

dict

Return type

keyed by labels and valued with frequencies, e.g .{“0-1”:.3, “0-2”:.1}

mdciao.utils.str_and_dict.get_sorted_trajectories(trajectories)

Common parser for something that can be interpreted as a trajectory

Parameters

trajectories (can be one of these things:) –

Returns

  • - for an input pattern, sorted trajectory filenames that match that pattern

  • - for filename, one list containing that filename

  • - for a list of filenames, a sorted list of filenames

mdciao.utils.str_and_dict.inform_about_trajectories(trajectories, only_show_first_and_last=False)

Return a string that informs about the trajectories

Parameters

trajectories (list of strings or mdtraj.Trajectory objects) –

Returns

listed_str

Return type

a string with the trajectory names separated by newlines

mdciao.utils.str_and_dict.intblocks_in_str(istr)

Return the integers that appear as contiguous blocks in strings

E.g. “GLU30@3.50-GDP396@frag1” returns [30,3,50,396,1]

Parameters

istr (string) –

Returns

ints

Return type

list

mdciao.utils.str_and_dict.iterate_and_inform_lambdas(ixtc, chunksize, stride=1, top=None)

Given a trajectory (as object or file), returns a strided, chunked iterator and function for progress report

Parameters
  • ixtc (str (filename) or mdtraj.Trajectory object) –

  • chunksize (int) – The trajectory will be iterated over in chunks of this many frames

  • stride (int, default is 1) – The stride with which to iterate over the trajectory

  • top (str (filename) or mdtraj.Topology) – If ixtc is a filename, the topology needed to read it

Returns

  • iterate, inform

  • iterate (lambda(ixtc)) – strided, chunked iterator over ixtc

  • inform (lambda(ixtc, traj_idx, chunk_idx, running_f)) – iterator that prints out streaming progress for every iteration

Note

The lambdas returned differ depending on the type of input, but signature is the same, s.t. the user does not have to care in posterior use

mdciao.utils.str_and_dict.latex_mathmode(istr, enclose=True)

Prepend symbol words with “\ ” and protect non-symbol words with ‘\mathrm{}’

  • symbol words are things that can be interpreted by LaTeX in math mode, e.g. ‘\alpha’ or ‘\AA’

  • non-symbol words are everything else

Works “opposite” to replace4latex and for the moment it’s my (very bad) solution for latexifying contact-labels’ fragments as super indices where the fragments themselves contain sub-indices (GLU30^$beta_2AR}

>>> replace4latex("There's an alpha and a beta here, also C_200")
"There's an $\alpha$ and a $\beta$ here, also $C_{200}$"
>>> latex_mathmode("There's an alpha and a beta here, also C_200")
"$\\mathrm{There's an }\\alpha\\mathrm{ and a }\\beta\\mathrm{ here, also C_200}$"
Parameters
  • istr (string) –

  • enclose (bool, default is True) – Return string enclosed in dollar-signs: ‘$string$’ Use False for cases where the LaTeX math-mode is already active

Returns

istr

Return type

string

mdciao.utils.str_and_dict.latex_superscript_fragments(contact_label, defrag='@')

Format fragment descriptors as Latex math-mode superscripts

Thinly wrap around _latex_superscript_one_fragment with splitlabel

Parameters
  • contact_label (str) – contact label of any form, as long as to AAs are joined with ‘-‘ character

  • defrag (char, default is '@') – The character to divide residue and fragment label

Returns

contact_label

Return type

str

mdciao.utils.str_and_dict.lexsort_ctc_labels(ctc_labels, reverse=False, columns=[0, 1], sep='-') → tuple

Sort contact-labels in ascending order of resSeq using both columns

Wraps around numpy.lexsort with some string handling

It will also work with contact-labels consisting of only one residue, e.g. in the cases where the “anchor” has been deleted or the frequencies have been aggregated to per-residue frequencies

>>> labels = ["ALA30@3.50-GLU50",
>>>           "HIS28-GLU50",
>>>           "ALA30-GLU20"]
>>> sorted_labels, order = mdciao.utils.str_and_dict.lexsort_ctc_labels(labels)
>>> sorted_ctc_labels
>>> ['HIS28-GLU50',
>>>  'ALA30-GLU20',
>>>  'ALA30@3.50-GLU50']
Parameters
  • ctc_labels (list of np.ndarray) – Strings describing the contact residues. It can contain also fragment information, which will be ignored when sorting but returned in sorted_ctc_labels

  • reverse (bool, default is False) – If True, sort in descending order, instead of ascending

  • columns (list) – The order of the columns, e.g. [0,1] means sort first by first column (idx 0), then by second column (idx 1).

  • sep (char, default is "-") – The character to use when separating the contact label into both residues

Returns

  • order (1D np.ndarray) – The indices of ctc_labels that sort it into sorted_ctc_labels

  • sorted_ctc_labels (list) – The sorted contact labels

mdciao.utils.str_and_dict.match_dict_by_patterns(patterns_as_csv, index_dict, verbose=False)

Joins all the values in an input dictionary if their key matches some patterns. This method also allows for exclusions (grep -e)

TODO: find out if regular expression re.findall() is better

Parameters
  • patterns_as_csv (str) – Comma-separated patterns to include or exclude, separated by commas, e.g. * “H*,-H8” will include all TMs but not H8 * “G.S*” will include all beta-sheets

  • index_dict (dictionary) – It is expected to contain iterable of ints or floats or anything that is “joinable” via np.hstack. Typically, something like: * {“H1”:[0,1,…30], “ICL1”:[31,32,…40],…}

Returns

matching_keys, matching_values

Return type

list, array of joined values

mdciao.utils.str_and_dict.replace4latex(istr, sindex=['_', '^'], symbols=['alpha', 'beta', 'gamma', 'sigma', 'mu', 'aa'], enclose_pure_text=False)

Return a string where symbols and super/sub-indices have been prepared for LaTeX

One quirk: when sub- or superindexing, the following types get protected in curly brackets to avoid only sub/super indexing the first character:

  • fully numeric: C_{300}

  • fully alphabetical: GLY_{ACE}

  • containing dots: L394^{G.H.26}

BUT mixed beta_2AR are left unprotected:

>>> replace4latex("mdciao can alpha Sigma_2 beta2AR ACE_GLY GLU30^3.50 no [frag1-WT] problem!")
'mdciao can $\\alpha$ $\\Sigma\\mathrm{_{2}}$ $\\beta\\mathrm{_2AR}$ $\\mathrm{{ACE}_{GLY}}$ $\\mathrm{GLU30^{3.50}}$ no [frag1-WT] problem!'
Parameters
  • istr (str) – The string to be prepare for LaTeX mathmode If a $ sign is already in istr, nothing will happen If a word in istr contains the same sindex character more than once, it’ll be skipped (ask [Knut](https://tex.stackexchange.com/questions/253080/why-am-i-getting-a-double-subscript-error))

  • sindex (list) – The characters that indicate super- and sub-indices

  • symbols (list) – The words that should be considered LaTeX symbols

Returns

lstr – The string with LaTex-mathmode insertions

Return type

str

mdciao.utils.str_and_dict.replace_w_dict(input_str, exp_rep_dict)

Sequentially perform string replacements on a string using a dictionary

Parameters
  • input_str (str) –

  • exp_rep_dict (dictionary) – keys are expressions that will be replaced with values, i.e. key = key.replace(key1, val1) for key1, val1 etc

Returns

Return type

key

mdciao.utils.str_and_dict.sort_dict_by_asc_values(idict, reverse=False)

Sort a dictionary by values

Parameters
  • idict (dict) – Input dictionary

  • reverse (bool, default is False) – Reverse the sorting order, i.e. sort by ascending order of values

Returns

odict

Indict sorted with its keys

sorted by its values

Return type

dict

mdciao.utils.str_and_dict.splitlabel(label, sep='-', defrag='@')

Split a contact label. Analogous to label.split(sep) but more robust because fragment names can contain the separator character.

Parameters
  • label (str) –

    Can be of any of these forms:

    The fragment names can contain the separator, e.g. ‘res1@B2AR-CT-res2@Gprot’ is possible. Residue names cannot contain the separator.

    The method assumes that labels start with a residue, (see above), else you’ll get weird behaviour.

  • sep (char, default is "-") – The character that separates pairs of labels

  • defrag (char, default is "@") – The character that separates residues form their host fragment

Returns

split – A list equivalent to having used label.split(sep) but the separator is ignored in the fragment labels.

Return type

list

mdciao.utils.str_and_dict.sum_dict_per_residue(idict, sep)

Return a “per-residue” sum of values from a “per-residue-pair” keyed dictionary

Note: There is a closely related method in mdciao.contacts.ContactGroup that allows to query the freqs from the object already aggregated by residue. This is for when the object is either not accessible, e.g. because the freqs were loaded from a file

Parameters
  • idict (dict) – Keyed with contact labels like “res1@frag1-res2@3.50” etc

  • sep (char) – Character that separates fragments in the label

Returns

aggr – keyed with “res1@frag1” etc

Return type

dict

mdciao.utils.str_and_dict.unify_freq_dicts(freqs, exclude=None, key_separator='-', replacement_dict=None, defrag=None, per_residue=False, is_freq=True, val_missing=0, verbose=True)

Provided with a dictionary of dictionaries, returns an equivalent, key-unified dictionary where all sub-dictionaries share their keys, putting zeroes where keys where absent originally.

Use key_separator for “GLU30-LY40” == “LYS40-GLU30” to be True

Parameters
  • freqs (dictionary of dictionaries, e.g.:) –

    {A:{key1:valA1, key2:valA2, key3:valA3},

    B:{ key2:valB2, key3:valB3}}

  • key_separator (str, default is "-") – Specify how residues are separated in the contact label, eg. “GLU30-LYS40”. With this knowledge, the method can split the label before comparison so that “GLU30-LYS40” is considered equal to “LYS40-GLU30”. Use “”, “none” or None to differentiate. It will also be passed to defrag_key in case defrag is not None.

  • exclude (list, default is None) – keys containing these strings will be excluded. NOTE: This is not implemented yet, will raise an error

  • replacement_dict (dict, default is {}) – all keys/strings will be subjected to replacements following this dictionary, st. “GLH30” is “GLU30” if replacement_dict is {“GLH”:”GLU”} This way mutations and or indexing can be accounted for in different setups

  • defrag (char, default is None) –

    If a char is given, “@”, anything after that character in the labels will be consider fragment information and ignored. This is only recommended for advanced users, usually the fragment information helps keep track of residue names in complex topologies:

    R201@frag1 and R201@frag3 will both be “R201”

  • per_residue (bool, default is False) – Aggregate interactions to their residues

  • is_freq (bool, default is True) – If the dictionaries actually contain frequencies or not. If not, some checks are omitted

  • val_missing (anything, default is 0) – What value to assign to the missing keys (TODO check the name of this in pandas)

  • verbose (bool, default is True) – Be verbose

Returns

unified_dict – A dictionary of dictionaries sharing keys: {A:{key1:valA1, key2:valA2, key3:valA3},

B:{key1:0, key2:valB2, key3:valB3}}

Return type

dictionary