mdciao.utils.str_and_dict

Functions for manipulating strings and dictionaries, also a bit of IO.

Functions

Functions

`average_freq_dict`(freqs[, weights])	Average frequencies (or anything) over dictionaries.
`choose_options_descencing`(options[, fmt, ...])	Return the first entry that's acceptable according to some rule
`defrag_key`(key[, defrag, sep])	Remove fragment information from a contact label
`delete_exp_in_keys`(idict, exp[, sep])	Assuming the keys in the dictionary are formed by two segments joined by a separator, e.g.
`delete_pattern_in_ctc_label`(pattern, label, sep)
`df_str_formatters`(df)	Return formatters for :obj:`~pandas.DataFrame.to_string'
`fnmatch_ex`(patterns_as_csv, list_of_keys)	Match the keys in `list_of_keys` against some naming patterns using Unix filename pattern matching TODO include link: https://docs.python.org/3/library/fnmatch.html
`freq_ascii2dict`(ifile[, comment])	Reads an ASCII file that contains contact frequencies (1st column) and contact labels (2nd and/or 3rd column).
`freq_file2dict`(ifile[, defrag])	Read a file containing the frequencies ("freq") and labels ("label") of pre-computed contacts
`get_trajectories_from_input`(trajectories)	Common parser for something that can be interpreted as a trajectory
`inform_about_trajectories`(trajectories[, ...])	Return a string that informs about the trajectories
`intblocks_in_str`(istr)	Return the integers that appear as contiguous blocks in strings.
`iterate_and_inform_lambdas`(ixtc, chunksize)	Given a trajectory (as object or file), returns a strided, chunked iterator and function for progress report
`latex_mathmode`(istr[, enclose])	Prepend symbol words with "\ " and protect non-symbol words with '\mathrm{}'
`latex_superscript_fragments`(contact_label[, ...])	Format fragment descriptors as Latex math-mode superscripts
`lexsort_ctc_labels`(ctc_labels[, reverse, ...])	Sort contact-labels in ascending order of resSeq using both columns
`match_dict_by_patterns`(patterns_as_csv, ...)	Joins all the values in an input dictionary if their key matches some patterns.
`order_key`(key, sep)
`print_wrap`(text[, width, just_return_string])	Print the text wrapping the lines to a given character width
`replace4latex`(istr[, sindex, symbols, ...])	Return a string where symbols and super/sub-indices have been prepared for LaTeX
`replace_w_dict`(input_str, exp_rep_dict)	Sequentially perform string replacements on a string using a dictionary
`sort_dict_by_asc_values`(idict[, reverse])	Sort a dictionary by ascending values
`splitlabel`(label[, sep, defrag, dont_split])	Split a contact label.
`sum_dict_per_residue`(idict, sep)	Return a "per-residue" sum of values from a "per-residue-pair" keyed dictionary
`unify_freq_dicts`(freqs[, exclude, ...])	Provided with a dictionary of dictionaries, returns an equivalent, key-unified dictionary where all sub-dictionaries share their keys, putting zeroes where keys where absent originally.

Classes

FilenameGenerator(output_desc, ...)

Generate per project filenames when you need them

class mdciao.utils.str_and_dict.FilenameGenerator(output_desc, ctc_cutoff_Ang, output_dir, graphic_ext, table_ext, graphic_dpi, t_unit)

Generate per project filenames when you need them

This is a WIP to consolidate all filenaming in one place, s.t. all sanitizing and project-specific naming operations happen here and not in the cli methods

A named tuple would’ve been enough, but we need some: methods for dynamic naming (e.g. per-residue or per-traj)

mdciao.utils.str_and_dict.average_freq_dict(freqs, weights=None, **unify_freq_dicts_kwargs)

Average frequencies (or anything) over dictionaries.

Typically, the input freqs are keyed first by system, then by contact label, e.g. {“T300”:{“GDP-R201”:1.0},

“T320”:{“GDP-R201”:.25}, “MUT”:{“GDP-L201”:25}}

The input data need not be unified, the method calls unify_freq_dicts internally. In the example above you have to call it with the arg replacement_dict={“L201:R201”} so tha it can understand that mutation when unifying

Parameters:

freqs (dict of dicts) – The dictionaries containing frequence dictionaries,
weights (dict, default is None) – relative weights of each dictionary
unify_freq_dicts_kwargs (Optional keyword args for unify_freq_dicts) – as listed below

Other Parameters:

key_separator (str, default is “-”) – Specify how residues are separated in the contact label, eg. “GLU30-LYS40”. With this knowledge, the method can split the label before comparison so that “GLU30-LYS40” is considered equal to “LYS40-GLU30”. Use “”, “none” or None to differentiate. It will also be passed to defrag_key in case defrag is not None.
exclude (list, default is None) – keys containing these strings will be excluded. NOTE: This is not implemented yet, will raise an error
replacement_dict (dict, default is {}) – all keys/strings will be subjected to replacements following this dictionary, st. “GLH30” is “GLU30” if replacement_dict is {“GLH”:”GLU”} This way mutations and or indexing can be accounted for in different setups
defrag (char, default is None) – If a char is given, “@”, anything after that character in the labels will be consider fragment information and ignored. This is only recommended for advanced users, usually the fragment information helps keep track of residue names in complex topologies:

R201@frag1 and R201@frag3 will both be “R201”
per_residue (bool, default is False) – Aggregate interactions to their residues
is_freq (bool, default is True) – If the dictionaries actually contain frequencies or not. If not, some checks are omitted
val_missing (anything, default is 0) – What value to assign to the missing keys (TODO check the name of this in pandas)
verbose (bool, default is True) – Be verbose

Returns:

averaged_dict – an averaged dictionary keyed only with the

Return type:

dict

mdciao.utils.str_and_dict.choose_options_descencing(options, fmt='%s', dont_accept=['none', 'na'])

Return the first entry that’s acceptable according to some rule

If no is found, “” is returned :Parameters: * options (list)

fmt (str, default is “%s”) – You can specify a different format here. Will only apply in case something is returned

dont_accept (list) – Move down the list if current item is one of these

Returns:: best – Either the best entry in options or “” if no option was found
Return type:: str

mdciao.utils.str_and_dict.defrag_key(key, defrag='@', sep='-')

Remove fragment information from a contact label

Parameters:

key (str) – Contact label with some sort of pair information e.g. e.g. R1@frag1 -E2@frag2->R1-E2
defrag (char, default is “@”) – Character that indicates the beginning of the fragment
sep (char, default is “-”) – Character that indicates the separation between first and second residue of the pair

mdciao.utils.str_and_dict.delete_exp_in_keys(idict, exp, sep='-')

Assuming the keys in the dictionary are formed by two segments joined by a separator, e.g. “GLU30-ARG40”, deletes the segment containing the input expression, exp

Will fail if not all keys have the expression to be deleted

Parameters:

idict (dictionary)
exp (str)
sep (str, default is “-“,)

Returns:

dict – dictionary with the same values but the keys lack the segment containing exp
dhk (list) – List with the deleted half-keys

mdciao.utils.str_and_dict.df_str_formatters(df)

Return formatters for :obj:`~pandas.DataFrame.to_string’

In principle, this should be solved by https://github.com/pandas-dev/pandas/issues/13032, but I cannot get it to work

Parameters:: df (DataFrame)
Returns:: formatters – Keyed with df-keys and valued with lambdas s.t. formatters[key][istr]=formatted_istr
Return type:: dict

mdciao.utils.str_and_dict.fnmatch_ex(patterns_as_csv, list_of_keys)

Match the keys in list_of_keys against some naming patterns using Unix filename pattern matching TODO include link: https://docs.python.org/3/library/fnmatch.html

This method also allows for exclusions (grep -e)

TODO: find out if regular expression re.findall() is better

Uses fnmatch under the hood

Parameters:

patterns_as_csv (str) – Patterns to include or exclude, separated by commas, e.g. * “H*,-H8” will include all TMs but not H8 * “G.S*” will include all beta-sheets
list_of_keys (list) – Keys against which to match the patterns, e.g. * [“H1”,”ICL1”, “H2”…”ICL3”,”H6”, “H7”, “H8”]

Returns:

matching_keys

Return type:

list

mdciao.utils.str_and_dict.freq_ascii2dict(ifile, comment='#')

Reads an ASCII file that contains contact frequencies (1st column) and contact labels (2nd and/or 3rd column). Columns are separated by tabs or spaces.

Contact labels have to come after the frequency in the form of “res1 res2, “res1-res2” or “res1 - res2”,

Columns other than the frequencies and the residue labels are ignored.

Examples

File produced by mdciao:

>>> #freq              label              residue idxs  sum
>>> 0.59 R389@G.H5.21    - L394@G.H5.26    348 353    0.59
>>> 0.46 L394@G.H5.26    - K270@6.32x32    353 972    1.05
>>> 0.34 L388@G.H5.20    - L394@G.H5.26    347 353    1.39
>>> 0.32 L394@G.H5.26    - L230@5.69x69    353 957    1.71
>>> 0.04 R385@G.H5.17    - L394@G.H5.26    344 353    1.75

Minimal file with mixed labeling

>>> 1 ALA30-GLU50
>>> .5 ASP31 - GLU51
>>> .1 ASP31 GLU50

TODO use pandas to allow more flex, not needed for the moment

Parameters:

ifile (str) – The filename to be read
comment (str, default is ‘#’) – Any line starting with any of these characters will be ignored

Returns:

freqdict – Keys are “res1-res2” (regardless of input) and values are freqs

Return type:

dictionary

mdciao.utils.str_and_dict.freq_file2dict(ifile, defrag=None)

Read a file containing the frequencies (“freq”) and labels (“label”) of pre-computed contacts

Parameters:

ifile (str) – Path to file, can be a .xlsx, .dat, .txt
defrag (str, default is None) – If passed a string, e.g “@”, the fragment information of the contact label will be deleted upon reading, so that R131@frag1 becomes R131. This is done by calling defrag_key internally

Returns:

dict

Return type:

keyed by labels and valued with frequencies, e.g .{“0-1”:.3, “0-2”:.1}

mdciao.utils.str_and_dict.get_trajectories_from_input(trajectories)

Common parser for something that can be interpreted as a trajectory

Parameters:

trajectories (can be one of these things:) –

pattern, e.g. “*.ext”
one single string containing a filename
one single mdtraj.Trajectory object
one list containing

just filenames

just mdtraj.Trajectory objects

a mix of filenames and mdtraj.Trajectory objects

Returns:

outtrajs – A list of trajectories. This list can be, depending on the input: * for an input pattern: sorted trajectory filenames that match that pattern * for filename or an mdtraj.Trajectory: one list containing that filename or mdtraj.Trajectory object * for a list, that same list (i.e. nothing happens)

Return type:

list

mdciao.utils.str_and_dict.inform_about_trajectories(trajectories, only_show_first_and_last=False)

Return a string that informs about the trajectories

Parameters:: trajectories (list of strings or mdtraj.Trajectory objects)
Returns:: listed_str
Return type:: a string with the trajectory names separated by newlines

mdciao.utils.str_and_dict.intblocks_in_str(istr)

Return the integers that appear as contiguous blocks in strings.

E.g. “GLU30@3.50 -GDP396@frag1” returns [30,3,50,396,1]

Will raise a ValueError if istr doesn’t contain any integers

Related, but not the same as int_from_AA_code

Parameters:: istr (string)
Returns:: ints
Return type:: list or ValueError if istr doesn’t have any integers in it

mdciao.utils.str_and_dict.iterate_and_inform_lambdas(ixtc, chunksize, stride=1, top=None, nchars_fname=None)

Given a trajectory (as object or file), returns a strided, chunked iterator and function for progress report

Parameters:

ixtc (str (filename) or mdtraj.Trajectory object)
chunksize (int) – The trajectory will be iterated over in chunks of this many frames
stride (int, default is 1) – The stride with which to iterate over the trajectory
top (str (filename) or mdtraj.Topology) – If ixtc is a filename, the topology needed to read it
nchars_fname (int, default is None) – The number of characters for the filename field. By default it adjusts automatically, but it can be fixed here in case you want to use the same field width for many files.

Returns:

iterate, inform
iterate (lambda(ixtc)) – strided, chunked iterator over ixtc
inform (lambda(ixtc, traj_idx, chunk_idx, running_f)) – iterator that returns a string informing on streaming progress for every iteration

Note

The lambdas returned differ depending on the type of input, but signature is the same, s.t. the user does not have to care in posterior use

mdciao.utils.str_and_dict.latex_mathmode(istr, enclose=True)

Prepend symbol words with “\ “ and protect non-symbol words with ‘\mathrm{}’

symbol words are things that can be interpreted by LaTeX in math mode, e.g. ‘\alpha’ or ‘\AA’
non-symbol words are everything else

Works “opposite” to replace4latex and for the moment it’s my (very bad) solution for latexifying contact-labels’ fragments as super indices where the fragments themselves contain sub-indices (GLU30^$beta_2AR}

>>> replace4latex("There's an alpha and a beta here, also C_200")
"There's an $\alpha$ and a $\beta$ here, also $C_{200}$"

>>> latex_mathmode("There's an alpha and a beta here, also C_200")
"$\\mathrm{There's an }\\alpha\\mathrm{ and a }\\beta\\mathrm{ here, also C_200}$"

Parameters:

istr (string)
enclose (bool, default is True) – Return string enclosed in dollar-signs: ‘$string$’ Use False for cases where the LaTeX math-mode is already active

Returns:

istr

Return type:

string

mdciao.utils.str_and_dict.latex_superscript_fragments(contact_label, defrag='@')

Format fragment descriptors as Latex math-mode superscripts

Thinly wrap around _latex_superscript_one_fragment with splitlabel

Parameters:

contact_label (str) – contact label of any form, as long as to AAs are joined with ‘-’ character
defrag (char, default is ‘@’) – The character to divide residue and fragment label

Returns:

contact_label

Return type:

str

mdciao.utils.str_and_dict.lexsort_ctc_labels(ctc_labels, reverse=False, columns=[0, 1], sep='-') → tuple

Sort contact-labels in ascending order of resSeq using both columns

Wraps around numpy.lexsort with some string handling.

It will also work with contact-labels consisting of only one residue, e.g. in the cases where the “anchor” has been deleted or the frequencies have been aggregated to per-residue frequencies

>>> labels = ["ALA30@3.50-GLU50",
>>>           "HIS28-GLU50",
>>>           "ALA30-GLU20"]
>>> sorted_labels, order = mdciao.utils.str_and_dict.lexsort_ctc_labels(labels)
>>> sorted_ctc_labels
>>> ['HIS28-GLU50',
>>>  'ALA30-GLU20',
>>>  'ALA30@3.50-GLU50']

Parameters:

ctc_labels (list or np.ndarray) – Strings describing the contact residues. It can contain also fragment information, which will be ignored when sorting but returned in sorted_ctc_labels
reverse (bool, default is False) – If True, sort in descending order, instead of ascending
columns (list) – The order of the columns, e.g. [0,1] means sort first by first column (idx 0), then by second column (idx 1).
sep (char, default is “-”) – The character to use when separating the contact label into both residues

Returns:

sorted_ctc_labels (list) – The sorted contact labels
order (1D np.ndarray) – The indices of ctc_labels that sort it into sorted_ctc_labels

mdciao.utils.str_and_dict.match_dict_by_patterns(patterns_as_csv, index_dict, verbose=False)

Joins all the values in an input dictionary if their key matches some patterns. This method also allows for exclusions (grep -e)

TODO: find out if regular expression re.findall() is better

Parameters:

patterns_as_csv (str) – Comma-separated patterns to include or exclude, separated by commas, e.g. * “H*,-H8” will include all TMs but not H8 * “G.S*” will include all beta-sheets
index_dict (dictionary) – It is expected to contain iterable of ints or floats or anything that is “joinable” via np.hstack. Typically, something like: * {“H1”:[0,1,…30], “ICL1”:[31,32,…40],…}

Returns:

matching_keys, matching_values

Return type:

list, array of joined values

mdciao.utils.str_and_dict.print_wrap(text, width=100, just_return_string=False, **kwargs)

Print the text wrapping the lines to a given character width

Parameters:

text (str) – The text to wrap
width (int, default is 100) – The maximum number of characters per line
just_return_string (bool, default is False) – Instead of printing, just return the string
kwargs (dict, optional) – Keyword arguments for print()

mdciao.utils.str_and_dict.replace4latex(istr, sindex=['_', '^'], symbols=['alpha', 'beta', 'gamma', 'sigma', 'mu', 'aa'], enclose_pure_text=False)

Return a string where symbols and super/sub-indices have been prepared for LaTeX

One quirk: when sub- or superindexing, the following types get protected in curly brackets to avoid only sub/super indexing the first character:

fully numeric: C_{300}

fully alphabetical: GLY_{ACE}

containing dots: L394^{G.H.26}

BUT mixed beta_2AR are left unprotected:

>>> replace4latex("mdciao can alpha Sigma_2 beta2AR ACE_GLY GLU30^3.50 no [frag1-WT] problem!")
'mdciao can $\\alpha$ $\\Sigma\\mathrm{_{2}}$ $\\beta\\mathrm{_2AR}$ $\\mathrm{{ACE}_{GLY}}$ $\\mathrm{GLU30^{3.50}}$ no [frag1-WT] problem!'

Parameters:

istr (str) – The string to be prepare for LaTeX mathmode If a $ sign is already in istr, nothing will happen If a word in istr contains the same sindex character more than once, it’ll be skipped (ask [Knut](https://tex.stackexchange.com/questions/253080/why-am-i-getting-a-double-subscript-error))
sindex (list) – The characters that indicate super- and sub-indices
symbols (list) – The words that should be considered LaTeX symbols

Returns:

lstr – The string with LaTex-mathmode insertions

Return type:

str

mdciao.utils.str_and_dict.replace_w_dict(input_str, exp_rep_dict)

Sequentially perform string replacements on a string using a dictionary

Parameters:

input_str (str)
exp_rep_dict (dictionary) – keys are expressions that will be replaced with values, i.e. key = key.replace(key1, val1) for key1, val1 etc

Return type:

key

mdciao.utils.str_and_dict.sort_dict_by_asc_values(idict, reverse=False)

Sort a dictionary by ascending values

Parameters:

idict (dict) – Input dictionary
reverse (bool, default is False) – Reverse the sorting order, i.e. sort by descending order of values

Returns:

odict –

Indict sorted with its keys: sorted by its values

Return type:

dict

mdciao.utils.str_and_dict.splitlabel(label, sep='-', defrag='@', dont_split=None)

Split a contact label. Analogous to label.split(sep) but more robust because fragment names can contain the separator character.

Parameters:

label (str) –
Can be any of these forms:
- res1
- res1@frag1
- res1@frag1-res2
- res1@frag1 -res2@frag2
- res1-res2@frag2
- res1-res2
The fragment names can contain the separator, e.g. ‘res1@B2AR-CT -res2@Gprot’ is possible. Residue names cannot contain the separator.

The method assumes that labels start with a residue, (see above), else you’ll get weird behaviour.
sep (char, default is “-”) – The character that separates pairs of labels
defrag (char, default is “@”) – The character that separates residues form their host fragment
dont_split (list, default is None) – The strings in this list won’t be separated even if they contain the separator. If the user knows that residue names like the ion “Cl-” or the ligand “DRG-1” might come up, they can “protect” them from splitting via this list.

Returns:

split – A list equivalent to having used label.split(sep) but the separator is ignored in the fragment labels.

Return type:

list

mdciao.utils.str_and_dict.sum_dict_per_residue(idict, sep)

Return a “per-residue” sum of values from a “per-residue-pair” keyed dictionary

Note: There is a closely related method in mdciao.contacts.ContactGroup that allows to query the freqs from the object already aggregated by residue. This is for when the object is either not accessible, e.g. because the freqs were loaded from a file

Parameters:

idict (dict) – Keyed with contact labels like “res1@frag1 -res2@3.50” etc
sep (char) – Character that separates fragments in the label

Returns:

aggr – keyed with “res1@frag1” etc

Return type:

dict

mdciao.utils.str_and_dict.unify_freq_dicts(freqs, exclude=None, key_separator='-', replacement_dict=None, defrag=None, per_residue=False, is_freq=True, val_missing=0, verbose=True)

Provided with a dictionary of dictionaries, returns an equivalent, key-unified dictionary where all sub-dictionaries share their keys, putting zeroes where keys where absent originally.

Use key_separator for “GLU30-LY40” == “LYS40-GLU30” to be True

Parameters:

freqs (dictionary of dictionaries, e.g.:) –

{A:{key1:valA1, key2:valA2, key3:valA3},
B:{ key2:valB2, key3:valB3}}
key_separator (str, default is “-”) – Specify how residues are separated in the contact label, eg. “GLU30-LYS40”. With this knowledge, the method can split the label before comparison so that “GLU30-LYS40” is considered equal to “LYS40-GLU30”. Use “”, “none” or None to differentiate. It will also be passed to defrag_key in case defrag is not None.
exclude (list, default is None) – keys containing these strings will be excluded. NOTE: This is not implemented yet, will raise an error
replacement_dict (dict, default is {}) – all keys/strings will be subjected to replacements following this dictionary, st. “GLH30” is “GLU30” if replacement_dict is {“GLH”:”GLU”} This way mutations and or indexing can be accounted for in different setups
defrag (char, default is None) – If a char is given, “@”, anything after that character in the labels will be consider fragment information and ignored. This is only recommended for advanced users, usually the fragment information helps keep track of residue names in complex topologies:

R201@frag1 and R201@frag3 will both be “R201”
per_residue (bool, default is False) – Aggregate interactions to their residues
is_freq (bool, default is True) – If the dictionaries actually contain frequencies or not. If not, some checks are omitted
val_missing (anything, default is 0) – What value to assign to the missing keys (TODO check the name of this in pandas)
verbose (bool, default is True) – Be verbose

Returns:

unified_dict – A dictionary of dictionaries sharing keys: {A:{key1:valA1, key2:valA2, key3:valA3},

B:{key1:0, key2:valB2, key3:valB3}}

Return type:

dictionary