mdciao.utils.str_and_dict¶
Functions for manipulating strings and dictionaries, also a bit of IO.
Functions¶
Functions
|
Average frequencies (or anything) over dictionaries. |
|
Return the first entry that’s acceptable according to some rule |
|
Remove fragment information from a contact label |
|
Assuming the keys in the dictionary are formed by two segments joined by a separator, e.g. |
|
|
Return formatters for :obj:`~pandas.DataFrame.to_string’ |
|
|
Match the keys in |
|
Reads an ASCII file that contains contact frequencies (1st column) and contact labels (2nd and/or 3rd column). |
|
Read a file containing the frequencies (“freq”) and labels (“label”) of pre-computed contacts |
|
Common parser for something that can be interpreted as a trajectory |
|
Return a string that informs about the trajectories |
|
Return the integers that appear as contiguous blocks in strings |
|
Given a trajectory (as object or file), returns a strided, chunked iterator and function for progress report |
|
Prepend symbol words with “\ ” and protect non-symbol words with ‘\mathrm{}’ |
|
Format fragment descriptors as Latex math-mode superscripts |
|
Sort contact-labels in ascending order of resSeq using both columns |
|
Joins all the values in an input dictionary if their key matches some patterns. |
|
|
|
Return a string where symbols and super/sub-indices have been prepared for LaTeX |
|
Sequentially perform string replacements on a string using a dictionary |
|
Sort a dictionary by values |
|
Split a contact label. |
|
Return a “per-residue” sum of values from a “per-residue-pair” keyed dictionary |
|
Provided with a dictionary of dictionaries, returns an equivalent, key-unified dictionary where all sub-dictionaries share their keys, putting zeroes where keys where absent originally. |
Classes
|
Generate per project filenames when you need them |
-
class
mdciao.utils.str_and_dict.
FilenameGenerator
(output_desc, ctc_cutoff_Ang, output_dir, graphic_ext, table_ext, graphic_dpi, t_unit)¶ Generate per project filenames when you need them
This is a WIP to consolidate all filenaming in one place, s.t. all sanitizing and project-specific naming operations happen here and not in the cli methods
- A named tuple would’ve been enough, but we need some
methods for dynamic naming (e.g. per-residue or per-traj)
-
mdciao.utils.str_and_dict.
average_freq_dict
(freqs, weights=None, **unify_freq_dicts_kwargs)¶ Average frequencies (or anything) over dictionaries.
Typically, the input
freqs
are keyed first by system, then by contact label, e.g. {“T300”:{“GDP-R201”:1.0},“T320”:{“GDP-R201”:.25}, “MUT”:{“GDP-L201”:25}}
The input data need not be unified, the method calls
unify_freq_dicts
internally. In the example above you have to call it with the arg replacement_dict={“L201:R201”} so tha it can understand that mutation when unifying- Parameters
freqs (dict of dicts) – The dictionaries containing frequence dictionaries,
weights (dict, default is None) – relative weights of each dictionary
unify_freq_dicts_kwargs –
- Returns
averaged_dict – an averaged dictionary keyed only with the
- Return type
dict
-
mdciao.utils.str_and_dict.
choose_options_descencing
(options, fmt='%s', dont_accept=['none', 'na'])¶ Return the first entry that’s acceptable according to some rule
If no is found, “” is returned :param options: :type options: list :param fmt: You can specify a different
format here. Will only apply in case something is returned
- Parameters
dont_accept (list) – Move down the list if current item is one of these
- Returns
best – Either the best entry in
options
or “” if no option was found- Return type
str
-
mdciao.utils.str_and_dict.
defrag_key
(key, defrag='@', sep='-')¶ Remove fragment information from a contact label
- Parameters
key (str) – Contact label with some sort of pair information e.g. e.g. R1@frag1-E2@frag2->R1-E2
defrag (char, default is "@") – Character that indicates the beginning of the fragment
sep (char, default is "-") – Character that indicates the separation between first and second residue of the pair
-
mdciao.utils.str_and_dict.
delete_exp_in_keys
(idict, exp, sep='-')¶ Assuming the keys in the dictionary are formed by two segments joined by a separator, e.g. “GLU30-ARG40”, deletes the segment containing the input expression,
exp
Will fail if not all keys have the expression to be deleted
- Parameters
idict (dictionary) –
exp (str) –
sep (str, default is "-",) –
- Returns
dict – dictionary with the same values but the keys lack the segment containing
exp
dhk (list) – List with the deleted half-keys
-
mdciao.utils.str_and_dict.
df_str_formatters
(df)¶ Return formatters for :obj:`~pandas.DataFrame.to_string’
In principle, this should be solved by https://github.com/pandas-dev/pandas/issues/13032, but I cannot get it to work
- Parameters
df (
DataFrame
) –- Returns
formatters – Keyed with
df
-keys and valued with lambdas s.t. formatters[key][istr]=formatted_istr- Return type
dict
-
mdciao.utils.str_and_dict.
fnmatch_ex
(patterns_as_csv, list_of_keys)¶ Match the keys in
list_of_keys
against some naming patterns using Unix filename pattern matching TODO include link: https://docs.python.org/3/library/fnmatch.htmlThis method also allows for exclusions (grep -e)
TODO: find out if regular expression re.findall() is better
Uses fnmatch under the hood
- Parameters
patterns_as_csv (str) – Patterns to include or exclude, separated by commas, e.g. * “H*,-H8” will include all TMs but not H8 * “G.S*” will include all beta-sheets
list_of_keys (list) – Keys against which to match the patterns, e.g. * [“H1”,”ICL1”, “H2”…”ICL3”,”H6”, “H7”, “H8”]
- Returns
matching_keys
- Return type
list
-
mdciao.utils.str_and_dict.
freq_ascii2dict
(ifile, comment=['#'])¶ Reads an ASCII file that contains contact frequencies (1st column) and contact labels (2nd and/or 3rd column). Columns are separated by tabs or spaces.
Contact labels have to come after the frequency in the form of “res1 res2, “res1-res2” or “res1 - res2”,
Columns other than the frequencies and the residue labels are ignored.
Examples
File produced by mdciao:
>>> #freq label residue idxs sum >>> 0.59 R389@G.H5.21 - L394@G.H5.26 348 353 0.59 >>> 0.46 L394@G.H5.26 - K270@6.32x32 353 972 1.05 >>> 0.34 L388@G.H5.20 - L394@G.H5.26 347 353 1.39 >>> 0.32 L394@G.H5.26 - L230@5.69x69 353 957 1.71 >>> 0.04 R385@G.H5.17 - L394@G.H5.26 344 353 1.75
Minimal file with mixed labeling
>>> 1 ALA30-GLU50 >>> .5 ASP31 - GLU51 >>> .1 ASP31 GLU50
TODO use pandas to allow more flex, not needed for the moment
- Parameters
ifile (str) – The filename to be read
comment (list of chars) – Any line starting with any of these characters will be ignored
- Returns
freqdict – Keys are “res1-res2” (regardless of input) and values are freqs
- Return type
dictionary
-
mdciao.utils.str_and_dict.
freq_file2dict
(ifile, defrag=None)¶ Read a file containing the frequencies (“freq”) and labels (“label”) of pre-computed contacts
- Parameters
ifile (str) – Path to file, can be a .xlsx, .dat, .txt
defrag (str, default is None) – If passed a string, e.g “@”, the fragment information of the contact label will be deleted upon reading, so that R131@frag1 becomes R131. This is done by calling
defrag_key
internally
- Returns
dict
- Return type
keyed by labels and valued with frequencies, e.g .{“0-1”:.3, “0-2”:.1}
-
mdciao.utils.str_and_dict.
get_sorted_trajectories
(trajectories)¶ Common parser for something that can be interpreted as a trajectory
- Parameters
trajectories (can be one of these things:) –
pattern, e.g. “*.ext”
one string containing a filename
list of filenames
one
mdtraj.Trajectory
objectlist of
mdtraj.Trajectory
objects
- Returns
- for an input pattern, sorted trajectory filenames that match that pattern
- for filename, one list containing that filename
- for a list of filenames, a sorted list of filenames
for one
mdtraj.Trajectory
object, a list containing that object
list of
mdtraj.Trajectory
objects (i.e. does nothing)
-
mdciao.utils.str_and_dict.
inform_about_trajectories
(trajectories, only_show_first_and_last=False)¶ Return a string that informs about the trajectories
- Parameters
trajectories (list of strings or
mdtraj.Trajectory
objects) –- Returns
listed_str
- Return type
a string with the trajectory names separated by newlines
-
mdciao.utils.str_and_dict.
intblocks_in_str
(istr)¶ Return the integers that appear as contiguous blocks in strings
E.g. “GLU30@3.50-GDP396@frag1” returns [30,3,50,396,1]
- Parameters
istr (string) –
- Returns
ints
- Return type
list
-
mdciao.utils.str_and_dict.
iterate_and_inform_lambdas
(ixtc, chunksize, stride=1, top=None)¶ Given a trajectory (as object or file), returns a strided, chunked iterator and function for progress report
- Parameters
ixtc (str (filename) or
mdtraj.Trajectory
object) –chunksize (int) – The trajectory will be iterated over in chunks of this many frames
stride (int, default is 1) – The stride with which to iterate over the trajectory
top (str (filename) or
mdtraj.Topology
) – Ifixtc
is a filename, the topology needed to read it
- Returns
iterate, inform
iterate (lambda(ixtc)) – strided, chunked iterator over
ixtc
inform (lambda(ixtc, traj_idx, chunk_idx, running_f)) – iterator that prints out streaming progress for every iteration
Note
The lambdas returned differ depending on the type of input, but signature is the same, s.t. the user does not have to care in posterior use
-
mdciao.utils.str_and_dict.
latex_mathmode
(istr, enclose=True)¶ Prepend symbol words with “\ ” and protect non-symbol words with ‘\mathrm{}’
symbol words are things that can be interpreted by LaTeX in math mode, e.g. ‘\alpha’ or ‘\AA’
non-symbol words are everything else
Works “opposite” to
replace4latex
and for the moment it’s my (very bad) solution for latexifying contact-labels’ fragments as super indices where the fragments themselves contain sub-indices (GLU30^$beta_2AR}>>> replace4latex("There's an alpha and a beta here, also C_200") "There's an $\alpha$ and a $\beta$ here, also $C_{200}$"
>>> latex_mathmode("There's an alpha and a beta here, also C_200") "$\\mathrm{There's an }\\alpha\\mathrm{ and a }\\beta\\mathrm{ here, also C_200}$"
- Parameters
istr (string) –
enclose (bool, default is True) – Return string enclosed in dollar-signs: ‘$string$’ Use False for cases where the LaTeX math-mode is already active
- Returns
istr
- Return type
string
-
mdciao.utils.str_and_dict.
latex_superscript_fragments
(contact_label, defrag='@')¶ Format fragment descriptors as Latex math-mode superscripts
Thinly wrap around
_latex_superscript_one_fragment
withsplitlabel
- Parameters
contact_label (str) – contact label of any form, as long as to AAs are joined with ‘-‘ character
defrag (char, default is '@') – The character to divide residue and fragment label
- Returns
contact_label
- Return type
str
-
mdciao.utils.str_and_dict.
lexsort_ctc_labels
(ctc_labels, reverse=False, columns=[0, 1], sep='-') → tuple¶ Sort contact-labels in ascending order of resSeq using both columns
Wraps around
numpy.lexsort
with some string handlingIt will also work with contact-labels consisting of only one residue, e.g. in the cases where the “anchor” has been deleted or the frequencies have been aggregated to per-residue frequencies
>>> labels = ["ALA30@3.50-GLU50", >>> "HIS28-GLU50", >>> "ALA30-GLU20"] >>> sorted_labels, order = mdciao.utils.str_and_dict.lexsort_ctc_labels(labels) >>> sorted_ctc_labels >>> ['HIS28-GLU50', >>> 'ALA30-GLU20', >>> 'ALA30@3.50-GLU50']
- Parameters
ctc_labels (list of np.ndarray) – Strings describing the contact residues. It can contain also fragment information, which will be ignored when sorting but returned in
sorted_ctc_labels
reverse (bool, default is False) – If True, sort in descending order, instead of ascending
columns (list) – The order of the columns, e.g. [0,1] means sort first by first column (idx 0), then by second column (idx 1).
sep (char, default is "-") – The character to use when separating the contact label into both residues
- Returns
order (1D np.ndarray) – The indices of
ctc_labels
that sort it intosorted_ctc_labels
sorted_ctc_labels (list) – The sorted contact labels
-
mdciao.utils.str_and_dict.
match_dict_by_patterns
(patterns_as_csv, index_dict, verbose=False)¶ Joins all the values in an input dictionary if their key matches some patterns. This method also allows for exclusions (grep -e)
TODO: find out if regular expression re.findall() is better
- Parameters
patterns_as_csv (str) – Comma-separated patterns to include or exclude, separated by commas, e.g. * “H*,-H8” will include all TMs but not H8 * “G.S*” will include all beta-sheets
index_dict (dictionary) – It is expected to contain iterable of ints or floats or anything that is “joinable” via np.hstack. Typically, something like: * {“H1”:[0,1,…30], “ICL1”:[31,32,…40],…}
- Returns
matching_keys, matching_values
- Return type
list, array of joined values
-
mdciao.utils.str_and_dict.
replace4latex
(istr, sindex=['_', '^'], symbols=['alpha', 'beta', 'gamma', 'sigma', 'mu', 'aa'], enclose_pure_text=False)¶ Return a string where symbols and super/sub-indices have been prepared for LaTeX
One quirk: when sub- or superindexing, the following types get protected in curly brackets to avoid only sub/super indexing the first character:
fully numeric: C_{300}
fully alphabetical: GLY_{ACE}
containing dots: L394^{G.H.26}
BUT mixed beta_2AR are left unprotected:
>>> replace4latex("mdciao can alpha Sigma_2 beta2AR ACE_GLY GLU30^3.50 no [frag1-WT] problem!") 'mdciao can $\\alpha$ $\\Sigma\\mathrm{_{2}}$ $\\beta\\mathrm{_2AR}$ $\\mathrm{{ACE}_{GLY}}$ $\\mathrm{GLU30^{3.50}}$ no [frag1-WT] problem!'
- Parameters
istr (str) – The string to be prepare for LaTeX mathmode If a $ sign is already in
istr
, nothing will happen If a word inistr
contains the samesindex
character more than once, it’ll be skipped (ask [Knut](https://tex.stackexchange.com/questions/253080/why-am-i-getting-a-double-subscript-error))sindex (list) – The characters that indicate super- and sub-indices
symbols (list) – The words that should be considered LaTeX symbols
- Returns
lstr – The string with LaTex-mathmode insertions
- Return type
str
-
mdciao.utils.str_and_dict.
replace_w_dict
(input_str, exp_rep_dict)¶ Sequentially perform string replacements on a string using a dictionary
- Parameters
input_str (str) –
exp_rep_dict (dictionary) – keys are expressions that will be replaced with values, i.e. key = key.replace(key1, val1) for key1, val1 etc
- Returns
- Return type
key
-
mdciao.utils.str_and_dict.
sort_dict_by_asc_values
(idict, reverse=False)¶ Sort a dictionary by values
- Parameters
idict (dict) – Input dictionary
reverse (bool, default is False) – Reverse the sorting order, i.e. sort by ascending order of values
- Returns
odict –
- Indict sorted with its keys
sorted by its values
- Return type
dict
-
mdciao.utils.str_and_dict.
splitlabel
(label, sep='-', defrag='@')¶ Split a contact label. Analogous to label.split(sep) but more robust because fragment names can contain the separator character.
- Parameters
label (str) –
- Can be of any of these forms:
res1
res1-res2
The fragment names can contain the separator, e.g. ‘res1@B2AR-CT-res2@Gprot’ is possible. Residue names cannot contain the separator.
The method assumes that labels start with a residue, (see above), else you’ll get weird behaviour.
sep (char, default is "-") – The character that separates pairs of labels
defrag (char, default is "@") – The character that separates residues form their host fragment
- Returns
split – A list equivalent to having used label.split(sep) but the separator is ignored in the fragment labels.
- Return type
list
-
mdciao.utils.str_and_dict.
sum_dict_per_residue
(idict, sep)¶ Return a “per-residue” sum of values from a “per-residue-pair” keyed dictionary
Note: There is a closely related method in
mdciao.contacts.ContactGroup
that allows to query the freqs from the object already aggregated by residue. This is for when the object is either not accessible, e.g. because the freqs were loaded from a file- Parameters
idict (dict) – Keyed with contact labels like “res1@frag1-res2@3.50” etc
sep (char) – Character that separates fragments in the label
- Returns
aggr – keyed with “res1@frag1” etc
- Return type
dict
-
mdciao.utils.str_and_dict.
unify_freq_dicts
(freqs, exclude=None, key_separator='-', replacement_dict=None, defrag=None, per_residue=False, is_freq=True, val_missing=0, verbose=True)¶ Provided with a dictionary of dictionaries, returns an equivalent, key-unified dictionary where all sub-dictionaries share their keys, putting zeroes where keys where absent originally.
Use
key_separator
for “GLU30-LY40” == “LYS40-GLU30” to be True- Parameters
freqs (dictionary of dictionaries, e.g.:) –
- {A:{key1:valA1, key2:valA2, key3:valA3},
B:{ key2:valB2, key3:valB3}}
key_separator (str, default is "-") – Specify how residues are separated in the contact label, eg. “GLU30-LYS40”. With this knowledge, the method can split the label before comparison so that “GLU30-LYS40” is considered equal to “LYS40-GLU30”. Use “”, “none” or None to differentiate. It will also be passed to
defrag_key
in casedefrag
is not None.exclude (list, default is None) – keys containing these strings will be excluded. NOTE: This is not implemented yet, will raise an error
replacement_dict (dict, default is {}) – all keys/strings will be subjected to replacements following this dictionary, st. “GLH30” is “GLU30” if replacement_dict is {“GLH”:”GLU”} This way mutations and or indexing can be accounted for in different setups
defrag (char, default is None) –
If a char is given, “@”, anything after that character in the labels will be consider fragment information and ignored. This is only recommended for advanced users, usually the fragment information helps keep track of residue names in complex topologies:
R201@frag1 and R201@frag3 will both be “R201”
per_residue (bool, default is False) – Aggregate interactions to their residues
is_freq (bool, default is True) – If the dictionaries actually contain frequencies or not. If not, some checks are omitted
val_missing (anything, default is 0) – What value to assign to the missing keys (TODO check the name of this in pandas)
verbose (bool, default is True) – Be verbose
- Returns
unified_dict – A dictionary of dictionaries sharing keys: {A:{key1:valA1, key2:valA2, key3:valA3},
B:{key1:0, key2:valB2, key3:valB3}}
- Return type
dictionary