-
Notifications
You must be signed in to change notification settings - Fork 1.6k
pattern search
The pattern.search module has a pattern matching system similar to regular expressions, that can be used to search a string by syntax (word function) or by semantics (word meaning).
It can be used by itself or with other pattern modules: web | db | en | search | vector | graph.
![](media/pattern_schema.gif)
The search()
function takes a string
(e.g., a word or a sequence of words) and returns a list of
non-overlapping matches in the given sentence. The match()
function returns the first match, or
None
. Both functions call compile()
, which takes a string and returns a
Pattern
object.
search(pattern, sentence)
match(pattern, sentence)
compile(pattern)
>>> from pattern.search import search
>>> print search('rabbit', 'big white rabbit')
[Match(words=[Word('rabbit')])]
Search strings can contain a wildcard character (*
) at the *start
, at the end*
, at *both*
ends or in*between
:
>>> print search('rabbit*', 'big white rabbit')
[Match(words=[Word('rabbit')])]
>>> print search('rabbit*', 'big white rabbits')
[Match(words=[Word('rabbits')])]
Search strings can contain multiple options separated by a vertical
dash (|
):
>>> print search('rabbit|cony|bunny', 'big black bunny')
[Match(words=[Word('bunny')])]
The examples above can also be resolved with (faster) regular
expressions. The pattern.search module is more useful with parsed
sentences. The pattern.en module has a
parser
that takes a string and assigns a part-of-speech tag to each word (e.g.,
NN
= noun, VB
=
verb, JJ
= adjective). The parser also
groups words into chunks (e.g., JJ
+ NN
= NP
= noun phrase)
and finds word lemmata (was → be).
A parsed Sentence
or Text
can be searched by part-of-speech tags:
>>> from pattern.search import search
>>> from pattern.en import parsetree
>>>
>>> t = parsetree('big white rabbit')
>>> print t
>>> print
>>> print search('JJ', t) # all adjectives
>>> print search('NN', t) # all nouns
>>> print search('NP', t) # all noun phrases
[Sentence('big/JJ/B-NP/O white/JJ/I-NP/O rabbit/NN/I-NP/O')]
[Match(words=[Word(u'big/JJ')]), Match(words=[Word(u'white/JJ')])]
[Match(words=[Word(u'rabbit/NN')])]
[Match(words=[Word(u'big/JJ'), Word(u'white/JJ'), Word(u'rabbit/NN')])]
A Taxonomy
can be used to define
semantical categories of words. Say we want to extract flower names from
a text. The search pattern is rather clumsy: "rose|lily|daisy|daffodil|begonia"
. A more
robust approach is to work with a taxonomy:
>>> from pattern.search import search, taxonomy
>>> from pattern.en import parsetree
>>>
>>> for f in ('rose', 'lily', 'daisy', 'daffodil', 'begonia'):
>>> taxonomy.append(f, type='flower')
>>>
>>> t = parsetree('A field of white daffodils.', lemmata=True)
>>> print t
>>> print
>>> print search('FLOWER', t)
[Sentence('A/DT/B-NP/O/a field/NN/I-NP/O/field of/IN/B-PP/B-PNP/of'
'white/JJ/B-NP/I-PNP/white daffodils/NNS/I-NP/I-PNP/daffodil ././O/O/.')]
[Match(words=[Word(u'white/JJ'), Word(u'daffodils/NNS')])]
Note how the search pattern has "FLOWER"
in uppercase. Since search()
is case-insensitive, uppercase words
are recognized as taxonomy terms (i.e., FLOWER
= rose + lily + daisy + daffodil +
begonia). Furthermore, since lemmata were parsed, daffodils is
recognized as the plural form of daffodil (the lemma), and as such
also part of FLOWER
.
Note that the returned match is white daffodils. Since search()
is (by default) greedy, the whole
NP
chunk is matched. In other words, white
daffodils is regarded as a more specific instance of daffodil.
A Pattern
is a sequence of constraints
that matches certain phrases in a (parsed) sentence. Each constraint can
match a word in the sentence. If a number of successive words
corresponds to the entire sequence of constraints, the phrase is a
match. The search is case-insensitive.
Constraints can be constructed for syntax (e.g., find all adjectives)
and semantics (e.g., find all product names). For example, Pattern.fromstring("NP
be
*
than
NP")
matches phrases such as the
cat was faster than the
mouse, and Chuck
Norris is cooler than Dolph Lundgren, since NP
matches any noun phrase. With TAXONOMY
, the global taxonomy
is used to categorize words.
pattern = Pattern(sequence=[])
pattern = Pattern.fromstring(string, taxonomy=TAXONOMY)
pattern.sequence # List of Constraint objects.
pattern.groups # List of groups, each a list of Constraint objects.
pattern.strict # Disable greedy matching?
pattern.scan(string)
pattern.search(sentence)
pattern.match(sentence, start=0)
-
Pattern.scan()
returnsTrue
ifSentence(string)
may yield matches.
It can be faster to scan a tagged string, before casting it to aSentence
orText
and searching it. -
Pattern.search()
returns a list ofMatch
objects from the given sentence. -
Pattern.match()
returns the firstMatch
found in the given sentence, orNone
.
For example:
>>> from pattern.search import Pattern
>>> from pattern.en import parsetree
>>>
>>> t = parsetree('Chuck Norris is cooler than Dolph Lundgren.', lemmata=True)
>>> p = Pattern.fromstring('{NP} be * than {NP}')
>>> m = p.match(t)
>>> print m.group(1)
>>> print m.group(2)
[Word(u'Chuck/NNP'), Word(u'Norris/NNP')]
[Word(u'Dolph/NNP'), Word(u'Lundgren/NNP')]
For {}
see below: Constraint string syntax.
A Constraint
matches a set of (tagged)
words and taxonomy terms. For example:
-
Constraint.fromstring('with|of')
matches either with or of. -
Constraint.fromstring('JJ?')
matches any adjective taggedJJ
, but it is optional. -
Constraint.fromstring('NP|SBJ')
matches subject noun phrases. -
Constraint.fromstring('QUANTITY')
matches siblings ofQUANTITY
in the taxonomy.
constraint = Constraint(
words = [],
tags = [],
chunks = [],
roles = [],
taxa = [],
optional = False,
multiple = False,
first = False,
taxonomy = TAXONOMY,
exclude = None,
custom = None )
constraint = Constraint.fromstring(string, **kwargs)
constraint.index
constraint.string
constraint.words # List of allowed words/lemmata (of, with, ...)
constraint.tags # List of allowed parts-of-speech (NN, JJ, ...)
constraint.chunks # List of allowed chunk types (NP, VP, ...)
constraint.roles # List of allowed chunk roles (SBJ, OBJ, ...)
constraint.taxa # List of allowed taxonomy terms.
constraint.taxonomy # Taxonomy used for lookup.
constraint.optional # True => matches zero or one word.
constraint.multiple # True => matches one or more words.
constraint.first # True => can only match first word.
constraint.exclude # None, or Constraint of disallowed options.
constraint.custom # function(word) returns True if match.
constraint.match(word)
Constraint.fromstring()
returns a new
Constraint
from the given string. It
takes the same optional parameters as the constructor. Uppercase words
in the given string indicate a
part-of-speech tag
(e.g., NN
, JJ
,
VP
) or a taxonomy term (e.g. PRODUCT
, PERSON
).
Some characters like |
or ?
are special. They affect how the constraint
is interpreted:
*Character* | *Example* | *Description* |
`(` | `(JJ)` | Wrapper for an optional constraint (deprecated). |
`[` | `[Mac OS X | Windows Vista]` | Wrapper for a constraint that has spaces.` ` |
`{` | `DT {JJ?} NN` | Wrapper for match groups, e.g., `Match.group(1)`. |
`_` | `Windows_Vista` | Converted to a space.` ` |
`|` | `ADJP|ADVP` | Separator for different options. |
`*` | `JJ*` | Used as a wildcard character. ` ` |
`!` | `!be|VB*` | Used in front of words/tags that are __not__ allowed. |
`?` | `JJ?` | Used as a suffix, constraint is optional. |
`+` | `RB|JJ+` or `JJ?+` or `*+` | Used as a suffix, constraint can span multiple words.` ` |
`^` | `^hello` | Used as a prefix, constraint can only match first word. |
The characters listed in the table must be escaped if used as content
(e.g., "\\?"
). You can use the module's
escape()
function. For example, escape("hello?")
returns "hello\\?"
.
Constraint.match()
returns True
if the given string or Word
is part of the constraint:
- the word (or its lemma) occurs in
Constraint.words
, OR, - the word (or its lemma) occurs in
Constraint.taxa
taxonomy tree, AND - the word tags and/or chunk tags match those defined in the constraint.
It is case-insensitive. Individual terms in Constraint.words
can contain wildcards (*+?
). Some part-of-speech-tags can also
contain wildcards: NN*
, VB*
, JJ*
, RB*
, PR*
, WP*
.
The following example demonstrates the use of optional and multiple constraints:
>>> from pattern.search import search
>>> from pattern.en import parsetree
>>>
>>> t = parsetree('tasty cat food')
>>> print t
>>> print
>>> print search('DT? RB? JJ? NN+', t)
[Sentence('tasty/JJ/B-NP/O cat/NN/I-NP/O food/NN/I-NP/O')]
[Match(words=[Word(u'tasty/JJ'), Word(u'cat/NN')]), Word(u'food/NN')])]
The pattern matches successive nouns (NN
),
optionally preceded by a determiner (DT
),
adverb (RB
) and/or adjective (JJ
). It matches anything from food to cat
food, tasty cat food, the tasty cat food, etc.
The pattern.en parser groups words that belong together into chunks. For
example, the black cat is one chunk, tagged NP
(i.e., a noun phrase). The head of the chunk is
cat. By default, when a constraint matches the chunk head, it will
greedily match the entire chunk. This means that if we search for cat
and the sentence has a big black cat, the entire chunk will be
returned.
This behavior can be disabled by passing a STRICT
flag to Pattern
, compile()
, search()
or match()
:
>>> from pattern.search import search, STRICT
>>> from pattern.en import parsetree
>>>
>>> t = parsetree('The black cat is lurking in the tree.')
>>> print search('cat', t)
[Match(words=[Word(u'The/DT'), Word(u'black/JJ'), Word(u'cat/NN')])]
>>> print search('cat', t, STRICT)
[Match(words=[Word(u'cat/NN')])]
Pattern.search()
returns a list of
Match
objects, where each match is a
list of successive Word
objects.
match = Match(pattern, words=[])
match.pattern # Pattern source.
match.words # List of Word objects.
match.string # String of words separated with a space.
match.start # Index of first word in sentence.
match.stop # Index of last word in sentence + 1.
match.group(index, chunked=False)
match.constraint(word)
match.constraints(chunk)
match.constituents(constraint=None)
-
Match.group()
returns a list ofWord
objects matching the constraints in a{
}
group. -
Match.constraint()
returns theConstraint
that matched the givenWord
, orNone
. -
Match.constraints()
returns the list of constraints that matched the givenChunk
. -
Match.constituents()
returns a list ofWord
andChunk
objects, with successive words grouped into chunks whenever possible. Optionally, returns only chunks/words that matched the givenConstraint
(or list of constraints). Chunks are only available if aSentence
orText
was given (i.e., not for plain strings).from pattern.search import match from pattern.en import parsetree
t = parsetree('The turtle was faster than the hare.', lemmata=True) m = match('NP be ADJP|ADVP than NP', t)
for w in m.words: print w, '\t =>', m.constraint(w)
Word(u'The/DT') => Constraint(chunks=['NP']) Word(u'turtle/NN') => Constraint(chunks=['NP']) Word(u'was/VBD') => Constraint(words=['be']) Word(u'faster/RBR') => Constraint(chunks=['ADJP', 'ADVP']) Word(u'than/IN') => Constraint(words=['than']) Word(u'the/DT') => Constraint(chunks=['NP']) Word(u'hare/NN') => Constraint(chunks=['NP'])
Match groups in the search pattern can be used to quickly retrieve what
you need from a Match
:
>>> t = parsetree('the big black dog')
>>> m = match('DT {JJ?+ NN}', t)
>>> print m.group(0) # full pattern
>>> print m.group(1) # {JJ?+ NN}
>>> print m.group(1).string
[Word(u'the/DT'), Word(u'big/JJ'), Word(u'black/JJ'), Word(u'dog/NN')]
[Word(u'big/JJ'), Word(u'black/JJ'), Word(u'dog/NN')]
'big black dog'
Each Word
in a Match
or Match.group()
has the following attributes:
word = Word(sentence, string, tag=None, index=0)
word.string
word.tag # Part-of-speech tag (e.g. NN, JJ).
word.sentence # Sentence (a list of successive Words).
word.index # Sentence index.
When search()
or match()
is given a string, Word
objects are created when the Match
is returned. When given a parsed Sentence
, Word
objects are linked from the sentence.
These have extra attributes. For an overview of Sentence
, Chunk
and Word
, see the
parse tree
documentation.
A taxonomy is a hierarchical tree of words classified by semantic type.
For example: a begonia is a flower, and a flower is a plant.
Taxonomy terms can be used as constraints. For example, "FLOWER"
will match flower as well as
begonia, or any other flower that has been defined in the taxonomy. By
default, constraints will retrieve terms from the global taxonomy
.
taxonomy = Taxonomy()
taxonomy.case_sensitive # False by default.
taxonomy.classifiers # List of Classifier objects.
taxonomy.append(term, type=None)
taxonomy.remove(term)
taxonomy.classify(term)
taxonomy.parents(term, recursive=False)
taxonomy.children(term, recursive=False)
-
Taxonomy.classify()
returns the (most recent) semantic type for a given term.
If the term is not in the taxonomy, it will tryTaxonomy.classifiers
(see further). -
Taxonomy.parents()
returns a list of all semantic types for the given term. -
Taxonomy.children()
returns a list of all terms for the given semantic type.
Withrecursive=True
, traverses the entire branch.
For example:
>>> from pattern.search import taxonomy, search
>>>
>>> taxonomy.append('chicken', type='food')
>>> taxonomy.append('chicken', type='bird')
>>> taxonomy.append('penguin', type='bird')
>>> taxonomy.append('bird', type='animal')
>>>
>>> print taxonomy.parents('chicken')
>>> print taxonomy.children('animal', recursive=True)
>>> print
>>> print search('FOOD', "I'm eating chicken.")
['bird', 'food']
['bird', 'penguin', 'chicken']
[Match(words=[Word('chicken')])]
A Classifier
offers a rule-based
approach to enrich the taxonomy. If a term is not in the taxonomy, it
will iterate its list of classifiers. Each classifier is a set of
functions that can be customized to yield the semantic type of a term.
classifier = Classifier(
parents = lambda term: [],
children = lambda term: [])
classifier.parents(term) # Returns a list of parents for a term.
classifier.children(term) # Returns a list of children for a term.
This is useful because taxonomy terms can't include wildcards (i.e., the
*
character is taken literally).
>>> from pattern.search import taxonomy, search
>>> from pattern.search import Classifier
>>>
>>> def parents(term):
>>> return ['quality'] if term.endswith('ness') else []
>>>
>>> taxonomy.classifiers.append(Classifier(parents))
>>> taxonomy.append('cat', type='animal')
>>>
>>> print search('QUALITY of a|an|the ANIMAL', 'the litheness of a cat')
[Match(words=[Word('litheness'), Word('of'), Word('a'), Word('cat')])]
This example creates a classifier that tags words ending in -ness as
quality
(e.g., sharpness, greediness). This
is more concise than manually adding all words ending in -ness to the
taxonomy. The quality
term is then used as a
constraint. Remember to always define Classifier.parents()
. For performance, Classifier.children()
is not called in Constraint.match()
.
The following example creates a rule-based taxonomy from the pattern.en.wordnet
module:
>>> from pattern.search import taxonomy, WordNetClassifier
>>>
>>> taxonomy.classifiers.append(WordNetClassifier())
>>>
>>> print taxonomy.parents('cat', pos='NN')
>>> print taxonomy.parents('cat', pos='VB')
['feline']
['flog']
|
The pattern.search module has a number of useful list functions:
unique(iterable) # Returns a new list with unique items.
find(function, iterable) # Returns first item for which function(item) is True.
product(iterable, repeat=1) # Returns a generator of all combinations of length n.
variations(iterable, optional=lambda item: False)
odict(items=[])
-
product()
returns a generator of all permutations, with replacement.
For example:product(\[1,2,3),
repeat=2)
yields:
\[1,1\],
\[1,2\],
\[1,3\],
\[2,1\],
\[2,2\],
\[2,3\],
\[3,1\],
\[3,2\],
\[3,3\]
-
variations()
returns all variations of a sequence with optional items (in-order). -
odict()
is a dictionary with ordered keys (e.g., like a stack).
The most recent keys will be returned first when traversing the dictionary.
odict.push()
takes a(key,
value)
-tuple and sets the given key to the given value. If the key exists, it pushes the updated item to the top of the stack.