Visualising results¶

One thing missing in a lot of corpus linguistic tools is the ability to produce high-quality visualisations of corpus data. corpkit uses the corpkit.interrogation.Interrogation.visualise method to do this.

Basics
Plot type
Plot style
Figure and font size
Title and labels
Subplots
TeX
Legend
Colours
Saving figures
Other options
Multiplotting

Note

Most of the keyword arguments from Pandas’ plot method are available. See their documentation for more information.

Basics ¶

visualise() is a method of all corpkit.interrogation.Interrogation objects. If you use from corpkit import *, it is also monkey-patched to Pandas objects.

Note

If you’re using a Jupyter Notebook, make sure you use %matplotlib inline or %matplotlib notebook to set the appropriate backend.

A common workflow is to interrogate a corpus, relative results, and visualise:

>>> from corpkit import *
>>> corpus = Corpus('data/P-parsed', load_saved=True)
>>> counts = corpus.interrogate({T: r'MD < __'})
>>> reldat = counts.edit('%', SELF)
>>> reldat.visualise('Modals', kind='line', num_to_plot=ALL).show()
### the visualise method can also attach to the df:
>>> reldat.results.visualise(...).show()

The current behaviour of visualise() is to return the pyplot module. This allows you to edit figures further before showing them. Therefore, there are two ways to show the figure:

>>> data.visualise().show()

>>> plt = data.visualise()
>>> plt.show()

Plot type ¶

The visualise method allows line, bar, horizontal bar (barh), area, and pie charts. Those with seaborn can also use 'heatmap' (docs). Just pass in the type as a string with the kind keyword argument. Arguments such as robust=True can then be used.

>>> data.visualise(kind='heatmap', robust=True, figsize=(4,12),
...                x_label='Subcorpus', y_label='Event').show()

Heatmap example

Stacked area/line plots can be made with stacked=True. You can also use filled=True to attempt to make all values sum to 100. Cumulative plotting can be done with cumulative=True. Below is an area plot beside an area plot where filled=True. Both use the vidiris colour scheme.

Plot style ¶

You can select from a number of styles, such as ggplot, fivethirtyeight, bmh, and classic. If you have seaborn installed (and you should), then you can also select from seaborn styles (seaborn-paper, seaborn-dark, etc.).

Figure and font size ¶

You can pass in a tuple of (width, height) to control the size of the figure. You can also pass an integer as fontsize.

Title and labels ¶

You can label your plot with title, x_label and y_label:

>>> data.visualise('Modals', x_label='Subcorpus', y_label='Relative frequency')

Subplots ¶

subplots=True makes a separate plot for every entry in the data. If using it, you’ll probably also want to use layout=(rows,columns) to specify how you’d like the plots arranged.

>>> data.visualise(subplots=True, layout=(2,3)).show()

Line charts using subplots and layout specification

TeX ¶

If you have LaTeX installed, you can use tex=True to render text with LaTeX. By default, visualise() tries to use LaTeX if it can.

Legend ¶

You can turn the legend off with legend=False. Legend placement can be controlled with legend_pos, which can be:

Margin	Figure		Margin
outside upper left	upper left	upper right	outside upper right
outside center left	center left	center right	outside center right
outside lower left	lower left	lower right	outside lower right

The default value, 'best', tries to find the best place automatically (without leaving the figure boundaries).

If you pass in draggable=True, you should be able to drag the legend around the figure.

Colours ¶

You can use the colours keyword argument to pass in:

A colour name recognised by matplotlib

A hex colour string

A colourmap object

There is an extra argument, black_and_white, which can be set to True to make greyscale plots. Unlike colours, it also updates line styles.

Saving figures ¶

To save a figure to a project’s images directory, you can use the save argument. output_format='png'/'pdf' can be used to change the file format.

>>> data.visualise(save='name', output_format='png')

Other options ¶

There are a number of further keyword arguments for customising figures:

Argument	Type	Action
grid	bool	Show grid in background
rot	int	Rotate x axis labels n degrees
shadow	bool	Shadows for some parts of plot
ncol	int	n columns for legend entries
explode	list	Explode these entries in pie
partial_pie	bool	Allow plotting of pie slices
legend_frame	bool	Show frame around legend
legend_alpha	float	Opacity of legend
reverse_legend	bool	Reverse legend entry order
transpose	bool	Flip axes of DataFrame
logx/logy	bool	Log scales
show_p_val	bool	Try to show p value in legend
interactive	bool	Experimental mpld3 use

A number of these and other options for customising figures are also described in the corpkit.interrogation.Interrogation.visualise method documentation.

Multiplotting ¶

The corpkit.interrogation.Interrogation also comes with a corpkit.interrogation.Interrogation.multiplot method, which can be used to show two different kinds of chart within the same figure.

The first two arguments for the function are two dict objects, which configure the larger and smaller plots.

For the second dictionary, you may pass in a data argument, which is an corpkit.interrogation.Interrogation or similar, and will be used as separate data for the subplots. This is useful, for example, if you want your main plot to show absolute frequencies, and your subplots to show relative frequencies.

There is also layout, which you can use to choose an overall grid design. You can also pass in a list of tuples if you like, to use your own layout. Below is a complete example, focussing on objects in risk processes:

>>> from corpkit import *
>>> from corpkit.dictionaries import *
### parse a collection of text files
>>> corpora = Corus('data/news')
### make dependency parse query: get get 'object' of risk process
>>> query = {F: roles.participant2, GL: r'\brisk', GF: roles.process}
### interrogate corpus, return lemma form, no coreference
>>> result = corpus.interrogate(query, show=[L], coref=False)
### generate relative frequencies, skip closed class, and sort
>>> inc = result.edit('%', SELF,
>>>                   sort_by='increase',
>>>                   skip_entries=wordlists.closedclass)
### visualise as area and line charts combined
>>> inc.multiplot({'title': 'Objects of risk processes, increasing',
>>>                'kind': 'area',
>>>                'x_label': 'Year',
>>>                'y_label': 'Percentage of all results'},
>>>                {'kind': 'line'}, layout=5)

multiplot example