Interrogation classes¶
Once you have searched a Corpus
object, you’ll want to be able to edit, visualise and store results. Remember that upon importing corpkit, any pandas.DataFrame
or pandas.Series
object is monkey-patched with save
, edit
and visualise
methods.
Interrogation¶
-
class
corpkit.interrogation.
Interrogation
(results=None, totals=None, query=None, concordance=None)[source]¶ Bases:
object
Stores results of a corpus interrogation, before or after editing. The main attribute,
results
, is a Pandas object, which can be edited or plotted.-
results
= None¶ pandas DataFrame containing counts for each subcorpus
-
totals
= None¶ pandas Series containing summed results
-
query
= None¶ dict containing values that generated the result
-
concordance
= None¶ pandas DataFrame containing concordance lines, if concordance lines were requested.
-
edit
(*args, **kwargs)[source]¶ Manipulate results of interrogations.
There are a few overall kinds of edit, most of which can be combined into a single function call. It’s useful to keep in mind that many are basic wrappers around pandas operations—if you’re comfortable with pandas syntax, it may be faster at times to use its syntax instead.
Basic mathematical operations: First, you can do basic maths on results, optionally passing in some data to serve as the denominator. Very commonly, you’ll want to get relative frequencies:
Example: >>> data = corpus.interrogate({W: r'^t'}) >>> rel = data.edit('%', SELF) >>> rel.results .. to that the then ... toilet tolerant tolerate ton 01 18.50 14.65 14.44 6.20 ... 0.00 0.00 0.11 0.00 02 24.10 14.34 13.73 8.80 ... 0.00 0.00 0.00 0.00 03 17.31 18.01 9.97 7.62 ... 0.00 0.00 0.00 0.00
For the operation, there are a number of possible values, each of which is to be passed in as a str:
+, -, /, *, %: self explanatory
k: calculate keywords
a: get distance metric
SELF is a very useful shorthand denominator. When used, all editing is performed on the data. The totals are then extracted from the edited data, and used as denominator. If this is not the desired behaviour, however, a more specific interrogation.results or interrogation.totals attribute can be used.
In the example above, SELF (or ‘self’) is equivalent to:
Example: >>> rel = data.edit('%', data.totals)
Keeping and skipping data: There are four keyword arguments that can be used to keep or skip rows or columns in the data:
- just_entries
- just_subcorpora
- skip_entries
- skip_subcorpora
Each can accept different input types:
- str: treated as regular expression to match
- list:
- of integers: indices to match
- of strings: entries/subcorpora to match
Example: >>> data.edit(just_entries=r'^fr', ... skip_entries=['free','freedom'], ... skip_subcorpora=r'[0-9]')
Merging data: There are also keyword arguments for merging entries and subcorpora:
- merge_entries
- merge_subcorpora
These take a dict, with the new name as key and the criteria as value. The criteria can be a str (regex) or wordlist.
Example: >>> from dictionaries.wordlists import wordlists >>> mer = {'Articles': ['the', 'an', 'a'], 'Modals': wordlists.modals} >>> data.edit(merge_entries=mer)
Sorting: The sort_by keyword argument takes a str, which represents the way the result columns should be ordered.
- increase: highest to lowest slope value
- decrease: lowest to highest slope value
- turbulent: most change in y axis values
- static: least change in y axis values
- total/most: largest number first
- infreq/least: smallest number first
- name: alphabetically
Example: >>> data.edit(sort_by='increase')
Editing text: Column labels, corresponding to individual interrogation results, can also be edited with replace_names.
Parameters: replace_names (str/list of tuples/dict) – Edit result names, then merge duplicate entries If replace_names is a string, it is treated as a regex to delete from each name. If replace_names is a dict, the value is the regex, and the key is the replacement text. Using a list of tuples in the form (find, replacement) allows duplicate substitution values.
Example: >>> data.edit(replace_names={r'object': r'[di]obj'})
Parameters: replace_subcorpus_names – Edit subcorpus names, then merge duplicates. The same as replace_names, but on the other axis. :type replace_subcorpus_names: str/list of tuples/dict
Other options: There are many other miscellaneous options.
Parameters: - keep_stats (bool) – Keep/drop stats values from dataframe after sorting
- keep_top (int) – After sorting, remove all but the top keep_top results
- just_totals (bool) – Sum each column and work with sums
- threshold – When using results list as dataframe 2, drop values
occurring fewer than n times. If not keywording, you can use:
‘high’: denominator total / 2500
‘medium’: denominator total / 5000
‘low’: denominator total / 10000
If keywording, there are smaller default thresholds
Parameters: span_subcorpora – If subcorpora are numerically named, span all from int to int2, inclusive :type span_subcorpora: tuple – (int, int2)
Parameters: - projection (tuple – (subcorpus_name, n)) – a to multiply results in subcorpus by n
- remove_above_p (bool) – Delete any result over p
- p (float) – set the p value
- revert_year – When doing linear regression on years, turn annual
subcorpora into 1, 2 ... :type revert_year: bool
Parameters: - print_info (bool) – Print stuff to console showing what’s being edited
- spelling (str – ‘US’/‘UK’) – Convert/normalise spelling:
Keywording options: If the operation is k, you’re calculating keywords. In this case, some other keyword arguments have an effect:
Parameters: keyword_measure – what measure to use to calculate keywords:
ll: log-likelihood `pd’: percentage difference
type keyword_measure: str
Parameters: selfdrop – When keywording, try to remove target corpus from reference corpus :type selfdrop: bool
Parameters: calc_all – When keywording, calculate words that appear in either corpus :type calc_all: bool
Returns: corpkit.interrogation.Interrogation
-
visualise
(title='', x_label=None, y_label=None, style='ggplot', figsize=(8, 4), save=False, legend_pos='best', reverse_legend='guess', num_to_plot=7, tex='try', colours='Accent', cumulative=False, pie_legend=True, rot=False, partial_pie=False, show_totals=False, transparent=False, output_format='png', interactive=False, black_and_white=False, show_p_val=False, indices=False, transpose=False, **kwargs)[source]¶ Visualise corpus interrogations using matplotlib.
Example: >>> data.visualise('An example plot', kind='bar', save=True) <matplotlib figure>
Parameters: - title (str) – A title for the plot
- x_label (str) – A label for the x axis
- y_label (str) – A label for the y axis
- kind (str ('line'/'bar'/'barh'/'pie'/'area')) – The kind of chart to make
- style (str ('ggplot'/'bmh'/'fivethirtyeight'/'seaborn-talk'/etc)) – Visual theme of plot
- figsize (tuple (int, int)) – Size of plot
- save (bool/str) – If bool, save with title as name; if str, use str as name
- legend_pos (str ('upper right'/'outside right'/etc)) – Where to place legend
- reverse_legend (bool) – Reverse the order of the legend
- num_to_plot (int/'all') – How many columns to plot
- tex (bool) – Use TeX to draw plot text
- colours (str) – Colourmap for lines/bars/slices
- cumulative (bool) – Plot values cumulatively
- pie_legend (bool) – Show a legend for pie chart
- partial_pie (bool) – Allow plotting of pie slices only
- show_totals (str -- 'legend'/'plot'/'both') – Print sums in plot where possible
- transparent (bool) – Transparent .png background
- output_format (str -- 'png'/'pdf') – File format for saved image
- black_and_white (bool) – Create black and white line styles
- show_p_val (bool) – Attempt to print p values in legend if contained in df
- indices (bool) – To use when plotting “distance from root”
- stacked (str) – When making bar chart, stack bars on top of one another
- filled (str) – For area and bar charts, make every column sum to 100
- legend (bool) – Show a legend
- rot (int) – Rotate x axis ticks by rot degrees
- subplots (bool) – Plot each column separately
- layout (tuple -- (int, int)) – Grid shape to use when subplots is True
- interactive (list -- [1, 2, 3]) – Experimental interactive options
Returns: matplotlib figure
-
save
(savename, savedir='saved_interrogations', **kwargs)[source]¶ Save an interrogation as pickle to
savedir
.Example: >>> o = corpus.interrogate(W, 'any') ### create ./saved_interrogations/savename.p >>> o.save('savename')
Parameters: - savename (str) – A name for the saved file
- savedir (str) – Relative path to directory in which to save file
- print_info (bool) – Show/hide stdout
Returns: None
-
quickview
(n=25)[source]¶ view top n results as painlessly as possible.
Example: >>> data.quickview(n=5) 0: to (n=2227) 1: that (n=2026) 2: the (n=1302) 3: then (n=857) 4: think (n=676)
Parameters: n (int) – Show top n results Returns: None
-
multiindex
(indexnames=None)[source]¶ Create a pandas.MultiIndex object from slash-separated results.
Example: >>> data = corpus.interrogate({W: 'st$'}, show=[L, F]) >>> data.results .. just/advmod almost/advmod last/amod 01 79 12 6 02 105 6 7 03 86 10 1 >>> data.multiindex().results Lemma just almost last first most Function advmod advmod amod amod advmod 0 79 12 6 2 3 1 105 6 7 1 3 2 86 10 1 3 0
Parameters: indexnames (list of strings) – provide custom names for the new index, or leave blank to guess. Returns: corpkit.interrogation.Interrogation
, withpandas.MultiIndex as
results
attribute
-
topwords
(datatype='n', n=10, df=False, sort=True, precision=2)[source]¶ Show top n results in each corpus alongside absolute or relative frequencies.
Parameters: - datatype (str (n/k/%)) – show abs/rel frequencies, or keyness
- n (int) – number of result to show
- df (bool) – return a DataFrame
- sort (bool) – Sort results, or show as is
- precision (int) – float precision to show
Example: >>> data.topwords(n=5) 1987 % 1988 % 1989 % 1990 % health 25.70 health 15.25 health 19.64 credit 9.22 security 6.48 cancer 10.85 security 7.91 health 8.31 cancer 6.19 heart 6.31 cancer 6.55 downside 5.46 flight 4.45 breast 4.29 credit 4.08 inflation 3.37 safety 3.49 security 3.94 safety 3.26 cancer 3.12
Returns: None
-
Interrodict¶
-
class
corpkit.interrogation.
Interrodict
(data)[source]¶ Bases:
collections.OrderedDict
A class for interrogations that do not fit in a single-indexed DataFrame.
Individual interrogations can be looked up via dict keys, indexes or attributes:
Example: >>> out_data['WSJ'].results >>> out_data.WSJ.results >>> out_data[3].results
Methods for saving, editing, etc. are similar to
corpkit.corpus.Interrogation
. Additional methods are available for collapsing into single (multiindexed) DataFrames.-
edit
(*args, **kwargs)[source]¶ Edit each value with
edit()
.See
edit()
for possible arguments.Returns: A corpkit.interrogation.Interrodict
-
multiindex
(indexnames=None)[source]¶ Create a pandas.MultiIndex version of results.
Example: >>> d = corpora.interrogate({F: 'compound', GL: '^risk'}, show=L) >>> d.keys() ['CHT', 'WAP', 'WSJ'] >>> d['CHT'].results .... health cancer security credit flight safety heart 1987 87 25 28 13 7 6 4 1988 72 24 20 15 7 4 9 1989 137 61 23 10 5 5 6 >>> d.multiindex().results ... health cancer credit security downside Corpus Subcorpus CHT 1987 87 25 13 28 20 1988 72 24 15 20 12 1989 137 61 10 23 10 WAP 1987 83 44 8 44 10 1988 83 27 13 40 6 1989 95 77 18 25 12 WSJ 1987 52 27 33 4 21 1988 39 11 37 9 22 1989 55 47 43 9 24
Returns: A corpkit.interrogation.Interrogation
-
save
(savename, savedir='saved_interrogations', **kwargs)[source]¶ Save an interrogation as pickle to savedir.
Parameters: - savename (str) – A name for the saved file
- savedir (str) – Relative path to directory in which to save file
- print_info (bool) – Show/hide stdout
Example: >>> o = corpus.interrogate(W, 'any') ### create ``saved_interrogations/savename.p`` >>> o.save('savename')
Returns: None
-
collapse
(axis='y')[source]¶ Collapse Interrodict on an axis or along interrogation name.
Parameters: axis (str: x/y/n) – collapse along x, y or name axis Example: >>> d = corpora.interrogate({F: 'compound', GL: r'^risk'}, show=L) >>> d.keys() ['CHT', 'WAP', 'WSJ'] >>> d['CHT'].results .... health cancer security credit flight safety heart 1987 87 25 28 13 7 6 4 1988 72 24 20 15 7 4 9 1989 137 61 23 10 5 5 6 >>> d.collapse().results ... health cancer credit security CHT 3174 1156 566 697 WAP 2799 933 582 1127 WSJ 1812 680 2009 537 >>> d.collapse(axis='x').results ... 1987 1988 1989 CHT 384 328 464 WAP 389 355 435 WSJ 428 410 473 >>> d.collapse(axis='key').results ... health cancer credit security 1987 282 127 65 93 1988 277 100 70 107 1989 379 253 83 91
Returns: A corpkit.interrogation.Interrogation
-
topwords
(datatype='n', n=10, df=False, sort=True, precision=2)[source]¶ Show top n results in each corpus alongside absolute or relative frequencies.
Parameters: - datatype (str (n/k/%)) – show abs/rel frequencies, or keyness
- n (int) – number of result to show
- df (bool) – return a DataFrame
- sort (bool) – Sort results, or show as is
- precision (int) – float precision to show
Example: >>> data.topwords(n=5) TBT % UST % WAP % WSJ % health 25.70 health 15.25 health 19.64 credit 9.22 security 6.48 cancer 10.85 security 7.91 health 8.31 cancer 6.19 heart 6.31 cancer 6.55 downside 5.46 flight 4.45 breast 4.29 credit 4.08 inflation 3.37 safety 3.49 security 3.94 safety 3.26 cancer 3.12
Returns: None
-
Concordance¶
-
class
corpkit.interrogation.
Concordance
(data)[source]¶ Bases:
pandas.core.frame.DataFrame
A class for concordance lines, with methods for saving, formatting and editing.
-
format
(kind='string', n=100, window=35, columns='all', **kwargs)[source]¶ Print concordance lines nicely, to string, LaTeX or CSV
Parameters: - kind (str) – output format: string/latex/csv
- n (int/‘all’) – Print first n lines only
- window (int) – how many characters to show to left and right
- columns (list) – which columns to show
Example: >>> lines = corpus.concordance({T: r'/NN.?/ >># NP'}, show=L) ### show 25 characters either side, 4 lines, just text columns >>> lines.format(window=25, n=4, columns=[L,M,R]) 0 we 're in tucson , then up north to flagst 1 e 're in tucson , then up north to flagstaff , then we we 2 tucson , then up north to flagstaff , then we went through th 3 through the grand canyon area and then phoenix and i sp
Returns: None
-
shuffle
(inplace=False)[source]¶ Shuffle concordance lines
Parameters: inplace (bool) – Modify current object, or create a new one Example: >>> lines[:4].shuffle() 3 01 1-01.txt.xml through the grand canyon area and then phoenix and i sp 1 01 1-01.txt.xml e 're in tucson , then up north to flagstaff , then we we 0 01 1-01.txt.xml we 're in tucson , then up north to flagst 2 01 1-01.txt.xml tucson , then up north to flagstaff , then we went through th
-