mdciao.utils.str_and_dict
Functions for manipulating strings and dictionaries, also a bit of IO.
Functions
Functions
|
Average frequencies (or anything) over dictionaries. |
|
Return the first entry that's acceptable according to some rule |
|
Remove fragment information from a contact label |
|
Assuming the keys in the dictionary are formed by two segments joined by a separator, e.g. |
|
|
Return formatters for :obj:`~pandas.DataFrame.to_string' |
|
|
Match the keys in |
|
Reads an ASCII file that contains contact frequencies (1st column) and contact labels (2nd and/or 3rd column). |
|
Read a file containing the frequencies ("freq") and labels ("label") of pre-computed contacts |
|
Common parser for something that can be interpreted as a trajectory |
|
Return a string that informs about the trajectories |
|
Return the integers that appear as contiguous blocks in strings. |
|
Given a trajectory (as object or file), returns a strided, chunked iterator and function for progress report |
|
Prepend symbol words with "\ " and protect non-symbol words with '\mathrm{}' |
|
Format fragment descriptors as Latex math-mode superscripts |
|
Sort contact-labels in ascending order of resSeq using both columns |
|
Joins all the values in an input dictionary if their key matches some patterns. |
|
|
|
Print the text wrapping the lines to a given character width |
|
Return a string where symbols and super/sub-indices have been prepared for LaTeX |
|
Sequentially perform string replacements on a string using a dictionary |
|
Sort a dictionary by ascending values |
|
Split a contact label. |
|
Return a "per-residue" sum of values from a "per-residue-pair" keyed dictionary |
|
Provided with a dictionary of dictionaries, returns an equivalent, key-unified dictionary where all sub-dictionaries share their keys, putting zeroes where keys where absent originally. |
Classes
|
Generate per project filenames when you need them |
- class mdciao.utils.str_and_dict.FilenameGenerator(output_desc, ctc_cutoff_Ang, output_dir, graphic_ext, table_ext, graphic_dpi, t_unit)
Generate per project filenames when you need them
This is a WIP to consolidate all filenaming in one place, s.t. all sanitizing and project-specific naming operations happen here and not in the cli methods
- A named tuple would’ve been enough, but we need some
methods for dynamic naming (e.g. per-residue or per-traj)
- mdciao.utils.str_and_dict.average_freq_dict(freqs, weights=None, **unify_freq_dicts_kwargs)
Average frequencies (or anything) over dictionaries.
Typically, the input
freqs
are keyed first by system, then by contact label, e.g. {“T300”:{“GDP-R201”:1.0},“T320”:{“GDP-R201”:.25}, “MUT”:{“GDP-L201”:25}}
The input data need not be unified, the method calls
unify_freq_dicts
internally. In the example above you have to call it with the arg replacement_dict={“L201:R201”} so tha it can understand that mutation when unifying- Parameters:
freqs (dict of dicts) – The dictionaries containing frequence dictionaries,
weights (dict, default is None) – relative weights of each dictionary
unify_freq_dicts_kwargs (Optional keyword args for
unify_freq_dicts
) – as listed below
- Other Parameters:
key_separator (str, default is “-”) – Specify how residues are separated in the contact label, eg. “GLU30-LYS40”. With this knowledge, the method can split the label before comparison so that “GLU30-LYS40” is considered equal to “LYS40-GLU30”. Use “”, “none” or None to differentiate. It will also be passed to
defrag_key
in casedefrag
is not None.exclude (list, default is None) – keys containing these strings will be excluded. NOTE: This is not implemented yet, will raise an error
replacement_dict (dict, default is {}) – all keys/strings will be subjected to replacements following this dictionary, st. “GLH30” is “GLU30” if replacement_dict is {“GLH”:”GLU”} This way mutations and or indexing can be accounted for in different setups
defrag (char, default is None) – If a char is given, “@”, anything after that character in the labels will be consider fragment information and ignored. This is only recommended for advanced users, usually the fragment information helps keep track of residue names in complex topologies:
R201@frag1 and R201@frag3 will both be “R201”
per_residue (bool, default is False) – Aggregate interactions to their residues
is_freq (bool, default is True) – If the dictionaries actually contain frequencies or not. If not, some checks are omitted
val_missing (anything, default is 0) – What value to assign to the missing keys (TODO check the name of this in pandas)
verbose (bool, default is True) – Be verbose
- Returns:
averaged_dict – an averaged dictionary keyed only with the
- Return type:
dict
- mdciao.utils.str_and_dict.choose_options_descencing(options, fmt='%s', dont_accept=['none', 'na'])
Return the first entry that’s acceptable according to some rule
If no is found, “” is returned :Parameters: * options (list)
fmt (str, default is “%s”) – You can specify a different format here. Will only apply in case something is returned
dont_accept (list) – Move down the list if current item is one of these
- Returns:
best – Either the best entry in
options
or “” if no option was found- Return type:
str
- mdciao.utils.str_and_dict.defrag_key(key, defrag='@', sep='-')
Remove fragment information from a contact label
- Parameters:
key (str) – Contact label with some sort of pair information e.g. e.g. R1@frag1-E2@frag2->R1-E2
defrag (char, default is “@”) – Character that indicates the beginning of the fragment
sep (char, default is “-”) – Character that indicates the separation between first and second residue of the pair
- mdciao.utils.str_and_dict.delete_exp_in_keys(idict, exp, sep='-')
Assuming the keys in the dictionary are formed by two segments joined by a separator, e.g. “GLU30-ARG40”, deletes the segment containing the input expression,
exp
Will fail if not all keys have the expression to be deleted
- Parameters:
idict (dictionary)
exp (str)
sep (str, default is “-“,)
- Returns:
dict – dictionary with the same values but the keys lack the segment containing
exp
dhk (list) – List with the deleted half-keys
- mdciao.utils.str_and_dict.df_str_formatters(df)
Return formatters for :obj:`~pandas.DataFrame.to_string’
In principle, this should be solved by https://github.com/pandas-dev/pandas/issues/13032, but I cannot get it to work
- Parameters:
df (
DataFrame
)- Returns:
formatters – Keyed with
df
-keys and valued with lambdas s.t. formatters[key][istr]=formatted_istr- Return type:
dict
- mdciao.utils.str_and_dict.fnmatch_ex(patterns_as_csv, list_of_keys)
Match the keys in
list_of_keys
against some naming patterns using Unix filename pattern matching TODO include link: https://docs.python.org/3/library/fnmatch.htmlThis method also allows for exclusions (grep -e)
TODO: find out if regular expression re.findall() is better
Uses fnmatch under the hood
- Parameters:
patterns_as_csv (str) – Patterns to include or exclude, separated by commas, e.g. * “H*,-H8” will include all TMs but not H8 * “G.S*” will include all beta-sheets
list_of_keys (list) – Keys against which to match the patterns, e.g. * [“H1”,”ICL1”, “H2”…”ICL3”,”H6”, “H7”, “H8”]
- Returns:
matching_keys
- Return type:
list
- mdciao.utils.str_and_dict.freq_ascii2dict(ifile, comment='#')
Reads an ASCII file that contains contact frequencies (1st column) and contact labels (2nd and/or 3rd column). Columns are separated by tabs or spaces.
Contact labels have to come after the frequency in the form of “res1 res2, “res1-res2” or “res1 - res2”,
Columns other than the frequencies and the residue labels are ignored.
Examples
File produced by mdciao:
>>> #freq label residue idxs sum >>> 0.59 R389@G.H5.21 - L394@G.H5.26 348 353 0.59 >>> 0.46 L394@G.H5.26 - K270@6.32x32 353 972 1.05 >>> 0.34 L388@G.H5.20 - L394@G.H5.26 347 353 1.39 >>> 0.32 L394@G.H5.26 - L230@5.69x69 353 957 1.71 >>> 0.04 R385@G.H5.17 - L394@G.H5.26 344 353 1.75
Minimal file with mixed labeling
>>> 1 ALA30-GLU50 >>> .5 ASP31 - GLU51 >>> .1 ASP31 GLU50
TODO use pandas to allow more flex, not needed for the moment
- Parameters:
ifile (str) – The filename to be read
comment (str, default is ‘#’) – Any line starting with any of these characters will be ignored
- Returns:
freqdict – Keys are “res1-res2” (regardless of input) and values are freqs
- Return type:
dictionary
- mdciao.utils.str_and_dict.freq_file2dict(ifile, defrag=None)
Read a file containing the frequencies (“freq”) and labels (“label”) of pre-computed contacts
- Parameters:
ifile (str) – Path to file, can be a .xlsx, .dat, .txt
defrag (str, default is None) – If passed a string, e.g “@”, the fragment information of the contact label will be deleted upon reading, so that R131@frag1 becomes R131. This is done by calling
defrag_key
internally
- Returns:
dict
- Return type:
keyed by labels and valued with frequencies, e.g .{“0-1”:.3, “0-2”:.1}
- mdciao.utils.str_and_dict.get_trajectories_from_input(trajectories)
Common parser for something that can be interpreted as a trajectory
- Parameters:
trajectories (can be one of these things:) –
pattern, e.g. “*.ext”
one single string containing a filename
one single
mdtraj.Trajectory
objectone list containing
just filenames
just
mdtraj.Trajectory
objectsa mix of filenames and
mdtraj.Trajectory
objects
- Returns:
outtrajs – A list of trajectories. This list can be, depending on the input: * for an input pattern: sorted trajectory filenames that match that pattern * for filename or an
mdtraj.Trajectory
: one list containing that filename ormdtraj.Trajectory
object * for a list, that same list (i.e. nothing happens)- Return type:
list
- mdciao.utils.str_and_dict.inform_about_trajectories(trajectories, only_show_first_and_last=False)
Return a string that informs about the trajectories
- Parameters:
trajectories (list of strings or
mdtraj.Trajectory
objects)- Returns:
listed_str
- Return type:
a string with the trajectory names separated by newlines
- mdciao.utils.str_and_dict.intblocks_in_str(istr)
Return the integers that appear as contiguous blocks in strings.
E.g. “GLU30@3.50-GDP396@frag1” returns [30,3,50,396,1]
Will raise a ValueError if istr doesn’t contain any integers
Related, but not the same as
int_from_AA_code
- Parameters:
istr (string)
- Returns:
ints
- Return type:
list or ValueError if istr doesn’t have any integers in it
- mdciao.utils.str_and_dict.iterate_and_inform_lambdas(ixtc, chunksize, stride=1, top=None, nchars_fname=None)
Given a trajectory (as object or file), returns a strided, chunked iterator and function for progress report
- Parameters:
ixtc (str (filename) or
mdtraj.Trajectory
object)chunksize (int) – The trajectory will be iterated over in chunks of this many frames
stride (int, default is 1) – The stride with which to iterate over the trajectory
top (str (filename) or
mdtraj.Topology
) – Ifixtc
is a filename, the topology needed to read itnchars_fname (int, default is None) – The number of characters for the filename field. By default it adjusts automatically, but it can be fixed here in case you want to use the same field width for many files.
- Returns:
iterate, inform
iterate (lambda(ixtc)) – strided, chunked iterator over
ixtc
inform (lambda(ixtc, traj_idx, chunk_idx, running_f)) – iterator that returns a string informing on streaming progress for every iteration
Note
The lambdas returned differ depending on the type of input, but signature is the same, s.t. the user does not have to care in posterior use
- mdciao.utils.str_and_dict.latex_mathmode(istr, enclose=True)
Prepend symbol words with “\ “ and protect non-symbol words with ‘\mathrm{}’
symbol words are things that can be interpreted by LaTeX in math mode, e.g. ‘\alpha’ or ‘\AA’
non-symbol words are everything else
Works “opposite” to
replace4latex
and for the moment it’s my (very bad) solution for latexifying contact-labels’ fragments as super indices where the fragments themselves contain sub-indices (GLU30^$beta_2AR}>>> replace4latex("There's an alpha and a beta here, also C_200") "There's an $\alpha$ and a $\beta$ here, also $C_{200}$"
>>> latex_mathmode("There's an alpha and a beta here, also C_200") "$\\mathrm{There's an }\\alpha\\mathrm{ and a }\\beta\\mathrm{ here, also C_200}$"
- Parameters:
istr (string)
enclose (bool, default is True) – Return string enclosed in dollar-signs: ‘$string$’ Use False for cases where the LaTeX math-mode is already active
- Returns:
istr
- Return type:
string
- mdciao.utils.str_and_dict.latex_superscript_fragments(contact_label, defrag='@')
Format fragment descriptors as Latex math-mode superscripts
Thinly wrap around
_latex_superscript_one_fragment
withsplitlabel
- Parameters:
contact_label (str) – contact label of any form, as long as to AAs are joined with ‘-’ character
defrag (char, default is ‘@’) – The character to divide residue and fragment label
- Returns:
contact_label
- Return type:
str
- mdciao.utils.str_and_dict.lexsort_ctc_labels(ctc_labels, reverse=False, columns=[0, 1], sep='-') tuple
Sort contact-labels in ascending order of resSeq using both columns
Wraps around
numpy.lexsort
with some string handling.It will also work with contact-labels consisting of only one residue, e.g. in the cases where the “anchor” has been deleted or the frequencies have been aggregated to per-residue frequencies
>>> labels = ["ALA30@3.50-GLU50", >>> "HIS28-GLU50", >>> "ALA30-GLU20"] >>> sorted_labels, order = mdciao.utils.str_and_dict.lexsort_ctc_labels(labels) >>> sorted_ctc_labels >>> ['HIS28-GLU50', >>> 'ALA30-GLU20', >>> 'ALA30@3.50-GLU50']
- Parameters:
ctc_labels (list or np.ndarray) – Strings describing the contact residues. It can contain also fragment information, which will be ignored when sorting but returned in
sorted_ctc_labels
reverse (bool, default is False) – If True, sort in descending order, instead of ascending
columns (list) – The order of the columns, e.g. [0,1] means sort first by first column (idx 0), then by second column (idx 1).
sep (char, default is “-”) – The character to use when separating the contact label into both residues
- Returns:
sorted_ctc_labels (list) – The sorted contact labels
order (1D np.ndarray) – The indices of
ctc_labels
that sort it intosorted_ctc_labels
- mdciao.utils.str_and_dict.match_dict_by_patterns(patterns_as_csv, index_dict, verbose=False)
Joins all the values in an input dictionary if their key matches some patterns. This method also allows for exclusions (grep -e)
TODO: find out if regular expression re.findall() is better
- Parameters:
patterns_as_csv (str) – Comma-separated patterns to include or exclude, separated by commas, e.g. * “H*,-H8” will include all TMs but not H8 * “G.S*” will include all beta-sheets
index_dict (dictionary) – It is expected to contain iterable of ints or floats or anything that is “joinable” via np.hstack. Typically, something like: * {“H1”:[0,1,…30], “ICL1”:[31,32,…40],…}
- Returns:
matching_keys, matching_values
- Return type:
list, array of joined values
- mdciao.utils.str_and_dict.print_wrap(text, width=100, just_return_string=False, **kwargs)
Print the text wrapping the lines to a given character width
- Parameters:
text (str) – The text to wrap
width (int, default is 100) – The maximum number of characters per line
just_return_string (bool, default is False) – Instead of printing, just return the string
kwargs (dict, optional) – Keyword arguments for print()
- mdciao.utils.str_and_dict.replace4latex(istr, sindex=['_', '^'], symbols=['alpha', 'beta', 'gamma', 'sigma', 'mu', 'aa'], enclose_pure_text=False)
Return a string where symbols and super/sub-indices have been prepared for LaTeX
One quirk: when sub- or superindexing, the following types get protected in curly brackets to avoid only sub/super indexing the first character:
fully numeric: C_{300}
fully alphabetical: GLY_{ACE}
containing dots: L394^{G.H.26}
BUT mixed beta_2AR are left unprotected:
>>> replace4latex("mdciao can alpha Sigma_2 beta2AR ACE_GLY GLU30^3.50 no [frag1-WT] problem!") 'mdciao can $\\alpha$ $\\Sigma\\mathrm{_{2}}$ $\\beta\\mathrm{_2AR}$ $\\mathrm{{ACE}_{GLY}}$ $\\mathrm{GLU30^{3.50}}$ no [frag1-WT] problem!'
- Parameters:
istr (str) – The string to be prepare for LaTeX mathmode If a $ sign is already in
istr
, nothing will happen If a word inistr
contains the samesindex
character more than once, it’ll be skipped (ask [Knut](https://tex.stackexchange.com/questions/253080/why-am-i-getting-a-double-subscript-error))sindex (list) – The characters that indicate super- and sub-indices
symbols (list) – The words that should be considered LaTeX symbols
- Returns:
lstr – The string with LaTex-mathmode insertions
- Return type:
str
- mdciao.utils.str_and_dict.replace_w_dict(input_str, exp_rep_dict)
Sequentially perform string replacements on a string using a dictionary
- Parameters:
input_str (str)
exp_rep_dict (dictionary) – keys are expressions that will be replaced with values, i.e. key = key.replace(key1, val1) for key1, val1 etc
- Return type:
key
- mdciao.utils.str_and_dict.sort_dict_by_asc_values(idict, reverse=False)
Sort a dictionary by ascending values
- Parameters:
idict (dict) – Input dictionary
reverse (bool, default is False) – Reverse the sorting order, i.e. sort by descending order of values
- Returns:
odict –
- Indict sorted with its keys
sorted by its values
- Return type:
dict
- mdciao.utils.str_and_dict.splitlabel(label, sep='-', defrag='@', dont_split=None)
Split a contact label. Analogous to label.split(sep) but more robust because fragment names can contain the separator character.
- Parameters:
label (str) –
- Can be any of these forms:
res1
res1-res2
The fragment names can contain the separator, e.g. ‘res1@B2AR-CT-res2@Gprot’ is possible. Residue names cannot contain the separator.
The method assumes that labels start with a residue, (see above), else you’ll get weird behaviour.
sep (char, default is “-”) – The character that separates pairs of labels
defrag (char, default is “@”) – The character that separates residues form their host fragment
dont_split (list, default is None) – The strings in this list won’t be separated even if they contain the separator. If the user knows that residue names like the ion “Cl-” or the ligand “DRG-1” might come up, they can “protect” them from splitting via this list.
- Returns:
split – A list equivalent to having used label.split(sep) but the separator is ignored in the fragment labels.
- Return type:
list
- mdciao.utils.str_and_dict.sum_dict_per_residue(idict, sep)
Return a “per-residue” sum of values from a “per-residue-pair” keyed dictionary
Note: There is a closely related method in
mdciao.contacts.ContactGroup
that allows to query the freqs from the object already aggregated by residue. This is for when the object is either not accessible, e.g. because the freqs were loaded from a file- Parameters:
idict (dict) – Keyed with contact labels like “res1@frag1-res2@3.50” etc
sep (char) – Character that separates fragments in the label
- Returns:
aggr – keyed with “res1@frag1” etc
- Return type:
dict
- mdciao.utils.str_and_dict.unify_freq_dicts(freqs, exclude=None, key_separator='-', replacement_dict=None, defrag=None, per_residue=False, is_freq=True, val_missing=0, verbose=True)
Provided with a dictionary of dictionaries, returns an equivalent, key-unified dictionary where all sub-dictionaries share their keys, putting zeroes where keys where absent originally.
Use
key_separator
for “GLU30-LY40” == “LYS40-GLU30” to be True- Parameters:
freqs (dictionary of dictionaries, e.g.:) –
- {A:{key1:valA1, key2:valA2, key3:valA3},
B:{ key2:valB2, key3:valB3}}
key_separator (str, default is “-”) – Specify how residues are separated in the contact label, eg. “GLU30-LYS40”. With this knowledge, the method can split the label before comparison so that “GLU30-LYS40” is considered equal to “LYS40-GLU30”. Use “”, “none” or None to differentiate. It will also be passed to
defrag_key
in casedefrag
is not None.exclude (list, default is None) – keys containing these strings will be excluded. NOTE: This is not implemented yet, will raise an error
replacement_dict (dict, default is {}) – all keys/strings will be subjected to replacements following this dictionary, st. “GLH30” is “GLU30” if replacement_dict is {“GLH”:”GLU”} This way mutations and or indexing can be accounted for in different setups
defrag (char, default is None) – If a char is given, “@”, anything after that character in the labels will be consider fragment information and ignored. This is only recommended for advanced users, usually the fragment information helps keep track of residue names in complex topologies:
R201@frag1 and R201@frag3 will both be “R201”
per_residue (bool, default is False) – Aggregate interactions to their residues
is_freq (bool, default is True) – If the dictionaries actually contain frequencies or not. If not, some checks are omitted
val_missing (anything, default is 0) – What value to assign to the missing keys (TODO check the name of this in pandas)
verbose (bool, default is True) – Be verbose
- Returns:
unified_dict – A dictionary of dictionaries sharing keys: {A:{key1:valA1, key2:valA2, key3:valA3},
B:{key1:0, key2:valB2, key3:valB3}}
- Return type:
dictionary