Interrogating corpora¶
Once you’ve built a corpus, you can search it for linguistic phenomena. This is done with the interrogate()
method.
Introduction¶
Interrogations can be performed on any corpkit.corpus.Corpus
object, but also, on corpkit.corpus.Subcorpus
objects, corpkit.corpus.File
objects and corpkit.corpus.Datalist
objects (slices of Corpus
objects). You can search plaintext corpora, tokenised corpora or fully parsed corpora using the same method. We’ll focus on parsed corpora in this guide.
>>> from corpkit import *
### words matching 'woman', 'women', 'man', 'men'
>>> query = {W: r'/(^wo)m.n/'}
### interrogate corpus
>>> corpus.interrogate(query)
### interrogate parts of corpus
>>> corpus[2:4].interrogate(query)
>>> corpus.files[:10].interrogate(query)
### if you have a subcorpus called 'abstract':
>>> corpus.subcorpora.abstract.interrogate(query)
Corpus interrogations will output a corpkit.interrogation.Interrogation
object, which stores a DataFrame of results, a Series of totals, a dict
of values used in the query, and, optionally, a set of concordance lines. Let’s search for proper nouns in The Great Gatsby and see what we get:
>>> corp = Corpus('gatsby-parsed')
### turn on concordancing:
>>> propnoun = corp.interrogate({P: '^NNP'}, do_concordancing=True)
>>> propnoun.results
gatsby tom daisy mr. wilson jordan new baker york miss
chapter1 12 32 29 4 0 2 10 21 6 19
chapter2 1 30 6 8 26 0 6 0 6 0
chapter3 28 0 1 8 0 22 5 6 5 1
chapter4 38 10 15 25 1 9 5 8 4 7
chapter5 36 3 26 4 0 0 1 1 1 1
chapter6 37 21 19 11 0 1 4 0 3 4
chapter7 63 87 60 9 27 35 9 2 5 1
chapter8 21 3 19 1 19 1 0 1 0 0
chapter9 27 5 9 14 4 3 4 1 4 1
>>> propnoun.totals
chapter1 232
chapter2 252
chapter3 171
chapter4 428
chapter5 128
chapter6 219
chapter7 438
chapter8 139
chapter9 208
dtype: int64
>>> propnoun.query
{'case_sensitive': False,
'corpus': 'gatsby-parsed',
'dep_type': 'collapsed-ccprocessed-dependencies',
'do_concordancing': True,
'exclude': False,
'excludemode': 'any',
'files_as_subcorpora': True,
'gramsize': 1,
...}
>>> propnoun.concordance # (sample)
54 chapter1 They had spent a year in france for no particular reason and then d
55 chapter1 n't believe it I had no sight into daisy 's heart but i felt that tom would
56 chapter1 into Daisy 's heart but I felt that tom would drift on forever seeking a li
57 chapter1 This was a permanent move said daisy over the telephone but i did n't be
58 chapter1 windy evening I drove over to East egg to see two old friends whom i scarc
59 chapter1 warm windy evening I drove over to east egg to see two old friends whom i s
60 chapter1 d a cheerful red and white Georgian colonial mansion overlooking the bay
61 chapter1 pen to the warm windy afternoon and tom buchanan in riding clothes was stan
62 chapter1 to the warm windy afternoon and Tom buchanan in riding clothes was standing with
Cool, eh? We’ll focus on what to do with these attributes later. Right now, we need to learn how to generate them.
Search types¶
Parsed corpora contain many different kinds of things we might like to search. There are word forms, lemma forms, POS tags, word classes, indices, and constituency and (three different) dependency grammar annotations. For this reason, the search query is a dict
object passed to the interrogate()
method, whose keys specify what to search, and whose values specify a query. The simplest ones are given in the table below.
Note
Single capital letter variables in code examples represent lowercase strings (W = 'w')
. These variables are made available by doing from corpkit import *
. They are used here for readability.
Search | Gloss |
---|---|
W | Word |
L | Lemma |
F | Function |
P | POS tag |
X | Word class |
E | NER tag |
A | Distance from root |
I | Index in sentence |
S | Sentence index |
R | Coref representative |
Because it comes first, and because it’s always needed, you can pass it in like an argument, rather than a keyword argument.
### get variants of the verb 'be'
>>> corpus.interrogate({L: 'be'})
### get words in 'nsubj' position
>>> corpus.interrogate({F: 'nsubj'})
Multiple key/value pairs can be supplied. By default, all must match for the result to be counted, though this can be changed with searchmode=ANY
or searchmode=ALL
:
>>> goverb = {P: r'^v', L: r'^go'}
### get all variants of 'go' as verb
>>> corpus.interrogate(goverb, searchmode=ALL)
### get all verbs and any word starting with 'go':
>>> corpus.interrogate(goverb, searchmode=ANY)
Grammatical searching¶
In the examples above, we match attributes of tokens. The great thing about parsed data, is that we can search for relationships between words. So, other possible search keys are:
Search | Gloss |
---|---|
G | Governor |
D | Dependent |
H | Coreference head |
T | Syntax tree |
A1 | Token 1 place to left |
Z1 | Token 1 place to right |
>>> q = {G: r'^b'}
### return any token with governor word starting with 'b'
>>> corpus.interrogate(q)
Governor, Dependent and Left/Right can be combined with the earlier table, allowing a large array of search types:
Match | Governor | Dependent | Coref head | Left/right | |
---|---|---|---|---|---|
Word | W | G | D | H | A1/Z1 |
Lemma | L | GL | DL | HL | A1L/Z1L |
Function | F | GF | DF | HF | A1F/Z1F |
POS tag | P | GP | DP | HP | A1P/Z1P |
Word class | X | GX | DX | HX | A1X/Z1X |
Distance from root | A | GA | DA | HA | A1A/Z1A |
Index | I | GI | DI | HI | A1I/Z1I |
Sentence index | S | GS | DS | HS | A1S/Z1S |
Syntax tree searching can’t be combined with other options. We’ll return to them in a minute, however.
Excluding results¶
You may also wish to exclude particular phenomena from the results. The exclude
argument takes a dict
in the same form a search
. By default, if any key/value pair in the exclude
argument matches, it will be excluded. This is controlled by excludemode=ANY
or excludemode=ALL
.
>>> from corpkit.dictionaries import wordlists
### get any noun, but exclude closed class words
>>> corpus.interrogate({P: r'^n'}, exclude={W: wordlists.closedclass})
### when there's only one search criterion, you can also write:
>>> corpus.interrogate(P, r'^n', exclude={W: wordlists.closedclass})
In many cases, rather than using exclude
, you could also remove results later, during editing.
What to show¶
Up till now, all searches have simply returned words. The final major argument of the interrogate
method is show
, which dictates what is returned from a search. Words are the default value. You can use any of the search values as a show value. show
can be either a single string or a list of strings. If a list is provided, each value is returned with forward slashes as delimiters.
>>> example = corpus.interrogate({W: r'fr?iends?'}, show=[W, L, P])
>>> list(example.results)
['friend/friend/nn', 'friends/friend/nns', 'fiend/fiend/nn', 'fiends/fiend/nns', ... ]
Unigrams are generated by default. To get n-grams, pass in an n value as gramsize
:
>>> example = corpus.interrogate({W: r'wom[ae]n]'}, show=N, gramsize=2)
>>> list(example.results)
['a/woman', 'the/woman', 'the/women', 'women/are', ... ]
So, this leaves us with a huge array of possible things to show, all of which can be combined if need be:
Match | Governor | Dependent | Coref Head | 1L position | 1R position | |
---|---|---|---|---|---|---|
Word | W | G | D | H | A1 | Z1 |
Lemma | L | GL | DL | HL | A1L | Z1L |
Function | F | GF | DF | HF | A1F | Z1F |
POS tag | P | GP | DP | HP | A1P | Z1P |
Word class | X | GX | DX | HX | A1X | Z1X |
Distance from root | A | GA | DA | HA | A1A | Z1R |
Index | I | GI | DI | HI | A1I | Z1I |
Sentence index | S | GS | DS | HS | A1S | Z1S |
One further extra show value is 'c'
(count), which simply counts occurrences of a phenomenon. Rather than returning a DataFrame of results, it will result in a single Series. It cannot be combined with other values.
Working with trees¶
If you have elected to search trees, by default, searching will be done with Java, using Tregex. If you don’t have Java, or if you pass in tgrep=True
, searching will the more limited Tgrep2 syntax. Here, we’ll concentrate on Tregex.
Tregex is a language for searching syntax trees like this one:
To write a Tregex query, you specify words and/or tags you want to match, in combination with operators that link them together. First, let’s understand the Tregex syntax.
To match any adjective, you can simply write:
JJ
with JJ representing adjective as per the Penn Treebank tagset. If you want to get NPs containing adjectives, you might use:
NP < JJ
where < means with a child/immediately below. These operators can be reversed: If we wanted to show the adjectives within NPs only, we could use:
JJ > NP
It’s good to remember that the output will always be the left-most part of your query.
If you only want to match Subject NPs, you can use bracketting, and the $ operator, which means sister/directly to the left/right of:
JJ > (NP $ VP)
In this way, you build more complex queries, which can extent all the way from a sentence’s root to particular tokens. The query below, for example, finds adjectives modifying book:
JJ > (NP <<# /book/)
Notice that here, we have a different kind of operator. The << operator means that the node on the right does not need to be a child, but can be a descendant. the # means head—that is, in SFL, it matches the Thing in a Nominal Group.
If we wanted to also match magazine or newspaper, there are a few different approaches. One way would be to use | as an operator meaning or:
JJ > (NP ( <<# /book/ | <<# /magazine/ | <<# /newspaper/))
This can be cumbersome, however. Instead, we could use a regular expression:
JJ > (NP <<# /^(book|newspaper|magazine)s*$/)
Though it is beyond the scope of this guide to teach Regular Expressions, it is important to note that Regular Expressions are extremely powerful ways of searching text, and are invaluable for any linguist interested in digital datasets.
Detailed documentation for Tregex usage (with more complex queries and operators) can be found here.
Tree show values¶
Though you can use the same Tregex query for tree searches, the output changes depending on what you select as the show
value. For the following sentence:
These are prosperous times.
you could write a query:
r'JJ < __'
Which would return:
Show | Gloss | Output |
---|---|---|
W | Word | prosperous |
T | Tree | (JJ prosperous) |
p | POS tag | JJ |
C | Count | 1 (added to total) |
Working with dependencies¶
When working with dependencies, you can use any of the long list of search and show values. It’s possible to construct very elaborate queries:
>>> from corpkit.dictionaries import process_types, roles
### nominal nsubj with verbal process as governor
>>> crit = {F: r'^nsubj$',
... GL: processes.verbal.lemmata,
... GF: roles.event,
... P: r'^N'}
### interrogate corpus, outputting the nsubj lemma
>>> sayers = parsed.interrogate(crit, show=L)
Working with metadata¶
If you’ve used speaker segmentation and/or metadata addition when building your corpus, you can tell the interrogate()
method to use these values as subcorpora, or restrict searches to particular values. The code below will limit searches to sentences spoken by Jason and Martin, or exclude them from the search:
>>> corpus.interrogate(query, just_metadata={'speaker': ['JASON', 'MARTIN']})
>>> corpus.interrogate(query, skip_metadata={'speaker': ['JASON', 'MARTIN']})
If you wanted to compare Jason and Martin’s contributions in the corpus as a whole, you could treat them as subcorpora:
>>> corpus.interrogate(query, subcorpora='speaker',
... just_metadata={'speaker': ['JASON', 'MARTIN']})
The method above, however, will make an interrogation with two subcorpora, ‘JASON’ AND MARTIN
. You can pass a list in as the subcorpora
keyword argument to generate a multiindex:
>>> corpus.interrogate(query, subcorpora=['folder', 'speaker'],
just_metadata={'speaker': ['JASON', 'MARTIN']})
Working with coreferences¶
One major challenge in corpus linguistics is the fact that pronouns stand in for other words. Parsing provides coreference resolution, which maps pronouns to the things they denote. You can enable this kind of parsing by specifying the dcoref annotator:
>>> corpus = Corpus('example.txt')
>>> ops = 'tokenize,ssplit,pos,lemma,parse,ner,dcoref'
>>> parsed = corpus.interrogate(operations=ops)
### print a plaintext representation of the parsed corpus
>>> print(parsed.plain)
0. Clinton supported the independence of Kosovo
1. He authorized the use of force.
If you have done this, you can use coref=True while interrogating to allow coreferent forms to be counted alongside query matches. For example, if you wanted to find all the processes Clinton is engaged in, you could do:
>>> from corpkit.dictionaries import roles
>>> query = {W: 'clinton', GF: roles.process}
>>> res = parsed.interrogate(query, show=L, coref=True)
>>> res.results.columns
This matches both Clinton and he, and thus gives us:
['support', 'authorize']
Multiprocessing¶
Interrogating the corpus can be slow. To speed it up, you can pass an integer as the multiprocess
keyword argument, which tells the interrogate()
method how many processes to create.
>>> corpus.interrogate({T: r'__ > MD'}, multiprocess=4)
Note
Too many parallel processes may slow your computer down. If you pass in multiprocessing=True
, the number of processes will equal the number of cores on your machine. This is usually a fairly sensible number.
N-grams¶
N-gramming can be generated by making gramsize
> 1:
>>> corpus.interrogate({W: 'father'}, show='L', gramsize=3)
Collocation¶
Collocations can be shown by making using window
:
>>> corpus.interrogate({W: 'father'}, show='L', window=6)
Saving interrogations¶
>>> interro.save('savename')
Interrogation savenames will be prefaced with the name of the corpus interrogated.
You can also quicksave interrogations:
>>> corpus.interrogate(T, r'/NN.?/', save='savename')
Exporting interrogations¶
If you want to quickly export a result to CSV, LaTeX, etc., you can use Pandas’ DataFrame methods:
>>> print(nouns.results.to_csv())
>>> print(nouns.results.to_latex())
Other options¶
interrogate()
takes a number of other arguments, each of which is documented in the API documentation.
If you’re done interrogating, you can head to the page on Editing results to learn how to transform raw frequency counts into something more meaningful. Or, hit Next to learn about concordancing.