Unsupervised Training of a Finite-State Sliding-Window Part-of-Speech Tagger

Enrique S´anchez-Villamil, Mikel L. Forcada, and Rafael C. Carrasco
Departament de Llenguatges i Sistemes Inform`atics
Universitat d’Alacant
E-03071 Alacant
Abstract. A simple, robust sliding-window part-of-speech tagger is
presented and a method is given to estimate its parameters from an untagged
corpus. Its performance is compared to a standard Baum-Welchtrained
hidden-Markov-model part-of-speech tagger. Transformation into
a finite-state machine —behaving exactly as the tagger itself— is demonstrated.
1 Introduction

A large fraction (typically 30%, but varying from one language to another) of
the words in natural language texts are words that, in isolation, may be assigned
more than one morphological analysis and, in particular, more than one
part of speech (PoS). The correct resolution of this kind of ambiguity for each
occurrence of the word in the text is crucial in many natural language processing
applications; for example, in machine translation, the correct equivalent of
a word may be very different depending on its part of speech.
This paper presents a sliding-window PoS tagger (SWPoST), that is, a system
which assigns the part of speech of a word based on the information provided by
a fixed window of words around it. The SWPoST idea is not new; however, we are
not aware of any SWPoST which, using reasonable approximations, may easily
be trained in an unsupervised manner; that is, avoiding costly manual tagging
of a corpus. Furthermore, as with any fixed-window SWPoST, and in contrast
with more customary approaches such as hidden Markov models (HMM), the
tagger may be implemented exactly as a finite-state machine (a Mealy machine).
The paper is organized as follows: section 2 gives some definitions, describes
the notation that will be used throughout the paper, and compares the slidingwindow
approach to the customary (HMM) approach to part-of-speech tagging;
section 3 describes the approximations that allow a SWPoST to be trained in an
unsupervised manner and describes the training process itself; section 4 describes
how the tagger may be used on new text and how it may be turned into a finitestate
tagger; section 5 describes a series of experiments performed to compare
 Work funded by the Spanish Government through grant TIC2003-08681-C02-01.
J. L. Vicedo et al. (Eds.): EsTAL 2004, LNAI 3230, pp. 454–463, 2004.
c Springer-Verlag Berlin Heidelberg 2004
Unsupervised Training of a Finite-State Sliding-Window 455
the performance of a SWPoST to that of a HMM tagger and to explore the size
of the resulting finite state taggers (after minimization); and, finally, concluding
remarks are given in section 6.
2 Preliminaries
Let Γ = {γ1, γ2,…,γ|Γ|} be the tagset for the task, that is, the set of PoS tags a
word may receive and W = {w1, w2,…} be the vocabulary of the task. A partition
of W is established so that wi ≡ wj if and only if both are assigned the same
subset of tags. Each of the classes of this partition is usually called an ambiguity
class. It is usual [1] to refine this partition so that, for high-frequency words,
each word class contains just one word whereas, for lower-frequency words, word
classes are made to correspond exactly to ambiguity classes (although it is also
possible to use one-word classes for all words or to use only ambiguity classes),
which allows for improved performance on very frequent ambiguous words while
keeping the number of parameters of the tagger under control.
Any such refinement will be denoted as Σ = {σ1, σ2,…,σ|Σ|} where σi are
word classes. In this paper, word classes will simply be ambiguity classes, without
any refinement. We will call T : Σ → 2Γ the function returning the set T(σ) of
PoS tags for each word class σ.
The part-of-speech tagging problem may be formulated as follows: given a
text w[1]w[2] …w[L] ∈ W+, each word w[t] is assigned (using a lexicon, a morphological
analyser, or a guesser) a word class σ[t] ∈ Σ to obtain an ambiguously
tagged text σ[1]σ[2] …σ[L] ∈ Σ+; the task of the PoS tagger is to obtain a tagged
text γ[1]γ[2] …γ[L] ∈ Γ + (with γ[t] ∈ T(σ[t])) as correct as possible.
Statistical part-of-speech tagging looks for the most likely tagging of an ambiguously
tagged text σ[1]σ[2] …σ[L]:
γ∗[1] …γ∗[L] = argmax
P(γ[1] …γ[L]|σ[1] …σ[L]) (1)
which, using Bayes’ formula, becomes equivalent to:
γ∗[1] …γ∗[L] = argmax
PS(γ[1] …γ[L])PL(σ[1] …σ[L]|γ[1] …γ[L])) (2)
where PS(γ[1] …γ[L]) is the probability of a particular tagging (syntactical
probability) and PL(σ[1] …σ[L]|γ[1] …γ[L]) is the probability of that particular
tagging generating the text σ[1] …σ[L] (lexical probability). In hidden Markov
models (HMM) [5], these probabilities are approximated as products; the syntactical
probabilities are modeled by a first-order Markov process:
PS(γ[1]γ[2] …γ[L]) =
pS(γ[t + 1]|γ[t]) (3)
where γ[0] and γ[L+1] are fixed delimiting tags (which we will denote as γ# and
will usually correspond to sentence boundaries); lexical probabilities are made
independent of context:
456 E. S´anchez-Villamil et al.
PL(σ[1]σ[2] …σ[L]|γ[1]γ[2] …γ[L]) =
pL(σ[t]|γ[t]). (4)
The number of trainable parameters for such a tagger is (|Γ| + |Σ|)|Γ|. Tagging
(searching for the optimal γ∗[1]γ∗[2] …γ∗[L]) is implemented using a very
efficient, left-to-right algorithm usually known as Viterbi’s algorithm [1, 5]. If
conveniently implemented, Viterbi’s algorithm can output a partial tagging each
time a nonambiguous word is seen, but maintains multiple hypotheses when reading
ambiguous words. HMMs may be trained either from tagged text (simply
by counting and taking probabilities to be equal to frequencies) or from untagged
text, using the well-known expectation-maximization backward-forward
Baum-Welch algorithm [5, 1].
In this paper we look at tagging from a completely different perspective. Instead
of using the inverted formulation in eq. (2) we approximate the probability
in eq. (1) directly as:
P(γ[1]γ[2] …γ[L]|σ[1]σ[2] …σ[L]) =
p(γ[t]|C(−)[t]σ[t]C(+)[t]) (5)
C(−)[t] = σ[t − N(−)]σ[t − N(−) + 1] ··· σ[t − 1]
is a left context of size N(−),
C(+)[t] = σ[t + 1]σ[t + 2] ··· σ[t + N(+)]
is a right context of size N(+), and σ[−N(−) + 1], σ[−N(−) + 2],…,σ[0] and
σ[L + 1], σ[L + 2],…σ[L + N(+)] are all set to a special delimiting word class
σ# such that T(σ#) = {γ#}, e.g., one containing the sentence-boundary marker
tag γ# ∈ Γ. This sliding window method is local in nature; it does not consider
any context beyond the window of N(−) + N(+) + 1 words; its implementation is
straightforward, even more that of Viterbi’s algorithm. The main problem is the
estimation of the probabilities p(γ[t]|C(−)[t]σ[t]C(+)[t]). From a tagged corpus,
these probabilities may be easily counted; in this paper, however, we propose a
way of estimating them from an untagged corpus. Another problem is the large
number of parameters of the model (worst case O(|Σ|
N(+)+N(−)+1|Γ|)); we will
discuss a way to reduce the number of parameters to just O(|Σ|
N(+)+N(−) |Γ|)
and show that, for many applications, N(−) = N(+) = 1 is an adequate choice.
3 Training from an Untagged Corpus
The main approximation in this model is the following: we will assume that the
probability of finding a certain tag γ[t] in the center of the window depends
only on the preceding context C(−)[t] and the succeeding context C(+)[t] but not
on the particular word class at position t, σ[t]; that is, the probability that a
Unsupervised Training of a Finite-State Sliding-Window 457
word receives a certain label depends only selectionally on the word (the context
determines the probabilities of each label, whereas the word just selects
labels among those in T(σ[t])). We will denote this probability as pC(−)γC(+) for
short (with the position index [t] dropped because of the invariance). The most
probable tag γ∗[t] is then
γ∗[t] = argmax
pC(−)[t]γC(+)[t], (6)
that is, the most probable tag in that context among those corresponding to
the current word class. The probabilities pC(−)γC(+) are easily estimated from a
tagged corpus (e.g., by counting) but estimating them from an untagged corpus
involves an iterative process; instead of estimating the probability we will estimate
the count ˜nC(−)γC(+) which can be interpreted as the effective number of
times that label γ would appear in the text between contexts C(−) and C(+).
p(γ|C(−)σC(+)) = kσ(−)σσ(+)n˜C(−)γC(+) if γ ∈ T(σ)
0 otherwise , (7)
where kσ(−)σσ(+) is a normalization factor
kσ(−)σσ(+) =


. (8)
Now, how can the counts ˜nC(−)[t]γC(+)[t] be estimated? If the window probabilities
p(γ|C(−)[t]σC(+)[t]) were known, they could be easily obtained from the
text itself as follows:
n˜C(−)γC(+) = 
nC(−)σC(+) p(γ|C(−)σC(+)), (9)
where nC(−)σC(+) is the number of times that label σ appears between contexts
C(−) and C(+); that is, one would add p(γ|C(−)σC(+)) each time a word class σ
containing tag γ appears between C(−) and C(+). Equations (7) and (9) may be
iteratively solved until the ˜nC(−)γC(+) converge. For the computation to be more
efficient, one can avoid storing the probabilities p(γ|C(−)σC(+)) by organizing
the iterations around the ˜nC(−)γC(+) as follows, by combining eqs. (7), (8), and
(9) and using an iteration index denoted with a superscript [k],
= ˜n[k−1]


, (10)
where the iteration may be easily seen as a process of successive corrections to
the effective counts ˜nC(−)γC(+) . A possible initial value is given by
C(−)γC(+) = 
, (11)
458 E. S´anchez-Villamil et al.
that is, assuming that, initially, all possible tags are equally probable for each
word class.
Equations (10) and (11) contain the counts nC(−)σC(+) which depend on
N(+) + N(−) + 1 word classes; if memory is at a premium, instead of reading
the text once to count these and then iterating, the text may be read in each
iteration to avoid storing the nC(−)σC(+) , and the ˜n[k]
C(−)γC(+) may be computed
on the fly. Iterations proceed until a selected convergence condition has been
met (e.g. a comparison of the ˜n[k]
C(−)γC(+) with respect to the ˜n[k−1]
C(−)γC(+) , or the
completion of a predetermined number of iterations).
4 Tagging Text: A Finite-State Tagger
Once the ˜nC(−)γC(+) have been computed, the winning tag for class σ in context
C(−) ··· C(+), eq. (6), may be easily computed for all of the words in a text.
Unlike with HMM [2], a sliding window PoS tagger may be turned exactly into
a finite-state transducer [6]; in particular, into a Mealy machine with transitions
having the form
σ[t − N(−)]σ[t − N(−) + 1] ··· σ[t + N(+) − 1] σ[t+N(+)]:γ∗
σ[t − N(−) + 1]σ[t − N(−) + 2] ··· σ[t + N(+)]
This Mealy machine reads a text σ[1] …σ[L] word by word and outputs
the winner tag sequence γ∗[1] …γ∗[L] with a delay of N(+) words. The resulting
transducer has, in the worst case, O(|Σ|
N(+)+N(−) ) states and O(|Σ|
transitions, but it may be minimized using traditional methods for finite-state
transducers into a compact version of the sliding window PoS tagger, which takes
into account the fact that different contexts may actually be grouped because
they lead to the same disambiguation results.
5 Experiments
This section reports experiments to assess the performance of sliding-window
part-of-speech using different amounts of context, compares it with that of customary
Baum-Welch-trained HMM taggers [1], and describes the conversion of
the resulting SWPoST taggers into finite-state machines.
The corpora we have used for training and testing is the Penn Treebank,
version 3 [4, 3], which has more than one million PoS-tagged words (1014377)
of English text taken from The Wall Street Journal. The word classes Σ of the
Treebank will be taken simply to be ambiguity classes, that is, subsets of the
collection of different part-of-speech tags (Γ). The Treebank has 45 different
part-of-speech tags; 244261 words are ambiguous (24.08%).
The experiments use a lexicon extracted from the Penn Treebank, that is,
a list of words with all the possible parts of speech observed. The exact tag
Unsupervised Training of a Finite-State Sliding-Window 459
given in the Treebank for each occurrence of each word is taken into account
for testing but not for training. However, to simulate the effect of using a real,
limited morphological analyser, we have filtered the resulting lexicon as follows:
– only the 14276 most frequent words have been kept, which ensures a 95%
text coverage (i.e, 5% of the words are unknown).
– for each word, any part-of-speech tag occuring less than 5% of the time has
been removed.
Using this simplified lexicon, texts in the Penn Treebank show 219 ambiguity
classes. Words which are not included in our lexicon are assigned to a special
ambiguity class (the open class) containing all tags representing parts of speech
that can grow (i.e. a new word can be a noun or a verb but hardly ever a
In order to train the taggers we have applied the following strategy, so that
we can use as much text as possible for training: the Treebank is divided into 20
similarly-sized sections; a leaving-one-out procedure is applied, using 19 sections
for training and the remaining one for testing, so that our results are the average
of all 20 different train–test configurations. Figures show the average correcttag
rate only over ambiguous words (non-ambiguous words are not counted as
successful disambiguations).
5.1 Effect of the Amount of Context
First of all, we show the results of a SWPoST using no context (N(−) = N(+) = 0)
as a baseline, and compare them to those of a Baum-Welch-trained HMM tagger
and to random tagging. As can be seen in figure 1, the performance of the
SWPoST without context is not much better than random tagging. This happens
because without context the SWPoST simply delivers an estimate of the most
likely tag in each class. We can also observe that the HMM has an accuracy of
around 61% of ambiguous words.
In order to improve the results one obviously needs to increase the context
(i.e., widen the sliding window). As a first step we show the results of using a
reduced context of only one word either before (N(−) = 1, N(+) = 0) or after
(N(−) = 0, N(+) = 1) the current word. Figure 2 shows how that even using
such a limited context the performance is more adequate. The number or trainable
parameters of the SWPoST in this case is O(|Σ||Γ|), slightly less than the
O(|Σ||Γ| + |Γ|
2) of the HMM tagger.
There is a significant difference between using as context the preceding (N(−) =
1 and N(+) = 0) or the succeeding (N(−) = 0 and N(+) = 1) word. The cognitive
origin of this difference could be due to the fact that when people process
language they tend to build hyphotheses based on what they have already heard
or read which are used to reduce the ambiguity of words as they arrive.
1 Our open class contains the tags CD, JJ, JJR, JJS, NN, NNP, NNPS, RB, RBR, RBS, UH,
460 E. S´anchez-Villamil et al.
0 2 4 6 8 10 12 14 16
Accuracy (100 – Word Error Rate)
SWPoS Tagger
Fig. 1. Comparison between an HMM tagger, the SWPoST with no context (N(−) =
N(+) = 0) and a random tagger
0 2 4 6 8 10 12 14 16
Accuracy (100 – Word Error Rate)
Fig. 2. Comparison between an HMM tagger and the SWPoST using N(−) = 1 and
N(+)=0 (preceding) and using N(−)=0 and N(+)=1 (succeeding)
Unsupervised Training of a Finite-State Sliding-Window 461
The next step is increasing a bit more the size of the context until having two
context words. In this case we have three different possibilities: using the two
immediately preceding words (N(−) = 2 and N(+) = 0), using one preceeding and
one succeeding word (N(−) = 1 and N(+) = 1), and using two succeeding words
(N(−) = 0 and N(+) = 2). We can see the results in figure 3. The performance of
the SWPoST is now much better than the HMM tagger when using a context of
N(−)=1 and N(+)=1. However when using the two succeeding words the results
are worse than with the HMM tagger, and the performance of the SWPoST with
N(−)=2 and N(+)=0 is about as good as that of an HMM tagger.
0 2 4 6 8 10 12 14 16
Accuracy (100 – Word Error Rate)
1 prec. and 1 suc.
2 preceding
2 succeeding
Fig. 3. Comparison between an HMM tagger and the SWPoST with using N(−)=2 and
N(+)=0 (2 preceding) and using N(−)=1 and N(+)=1 (1 prec. and 1 suc.) and N(−)=0
and N(+)=2 (2 succeeding)
Finally, we tried increasing the context a bit more, until using three context
words in all possible geometries, but the results were not as good as we expected
(actually worse) due to the fact that the corpus is not large enough to allow the
estimation of O|Γ||Σ|
3) parameters.
The whole set of figures shows that SWPoST training usually converges after
three or four iterations, which makes training very efficient in terms of time.
5.2 Finite-State Sliding-Window PoS Tagger
Once we have analysed the performance of the SWPoST we study its transformation
into an equivalent finite-state transducer (FST). Given that the best
results reported in the previous section correspond to using the context N(−)=1
462 E. S´anchez-Villamil et al.
N(-) = N(+) = 1
N(-) = 1 N(+) = 0
N(-) = 0 N(+) = 1
N(-) = N(+) = 0
Fig. 4. Fallback strategy
and N(+)=1, we build a FST that has a decision delay of 1 time unit, with
transitions of the form
σ[t − 1]σ[t]
σ[t+1]:γ∗[t] −→ σ[t]σ[t + 1].
Many of these transitions correspond to contexts that have never or hardly
ever been observed in the corpus. To improve the accuracy of the tagger, a fallback
strategy was applied; this strategy uses SWPoST with smaller contexts
trained on the same corpus to define the output of these unseen transitions. Figure
4 shows the order of preference of the fallback strategy: if N(−) = 1, N(+) = 1
fails, the next best tagger N(−) = 1, N(+) = 0 is used; if this fails, N(−) =
0, N(+) = 1 is used, etc. The resulting FST has a slightly improved performance,
reaching 67.15% accuracy for ambiguous words.
5.3 Minimization of the Finite-State SWPoST
The FST created in this way has a large number of states; customary finite-state
minimization may be expected to reduce the number of states and therefore
reduce memory requirements. The algorithm to build the FST generates in our
case 48400 states (|Σ|
2) and 10648000 (|Σ|
3) transitions. After minimization the
FST is reduced to 22137 states and 4870140 transitions. Given the large amount
of ambiguity classes, minimizing to about half the size is not far from what we
6 Concluding Remarks
We have shown that, as commonly-used HMM taggers, simple and intuitive
sliding-window part-of-speech taggers (SWPoST) may be iteratively trained in
an unsupervised manner using reasonable approximations to reduce the number
of trainable parameters. The number of trainable parameters depends on the
size of the sliding window. Experimental results with the Penn Treebank show
that the performance of SWPoST and HMM taggers having a similar number of
Unsupervised Training of a Finite-State Sliding-Window 463
trainable parameters is comparable. The best results, better than those of HMM
taggers, are obtained using a SWPoST using a context of one preceding and one
succeeding word, for a worst-case total of 2178000 parameters (with the HMM
tagger having only 11925). The SWPoST can be exactly implemented as a finitestate
transducer which, after minimization, has 22137 states and 4870140 transitions.
Furthermore, the functioning of SWPoST is simple and intuitive, which
allows for simple implementation and maintenance; for instance, if a training
error is found it is easy to manually correct a transition in the resulting FST.
We are currently studying ways to reduce further the number of states and
transitions at a small price in tagging accuracy, by using probabilistic criteria to
prune uncommon contexts which do not contribute significantly to the overall
1. D. Cutting, J. Kupiec, J. Pedersen, and P. Sibun. A practical part-of-speech tagger.
In Third Conference on Applied Natural Language Processing. Association for
Computational Linguistics. Proceedings of the Conference, pages 133–140, Trento,
Italia, 31 marzo–3 abril 1992.
2. Andr´e Kempe. Finite state transducers approximating hidden Markov models. In
Philip R. Cohen and Wolfgang Wahlster, editors, Proceedings of the Thirty-Fifth
Annual Meeting of the Association for Computational Linguistics and Eighth Conference
of the European Chapter of the Association for Computational Linguistics,
pages 460–467, Somerset, New Jersey, 1997.
3. Mitchell Marcus, Grace Kim, Mary Ann Marcinkiewicz, Robert MacIntyre, Ann
Bies, Mark Ferguson, Karen Katz, and Britta Schasberger. The Penn Treebank:
Annotating predicate argument structure. In Proc. ARPA Human Language Technology
Workshop, pages 110–115, 1994.
4. Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. Building a
large annotated corpus of english: the Penn Treebank. Computational linguistics,
19:313–330, 1993. Reprinted in Susan Armstrong, ed. 1994, Using large corpora,
Cambridge, MA: MIT Press, 273–290.
5. Lawrence R. Rabiner. A tutorial on hidden Markov models and selected applications
in speech recognition. Proceedings of the IEEE, 77(2):257–286, 1989.
6. E. Roche and Y. Schabes. Introduction. In E. Roche and Y. Schabes, editors,
Finite-State Language Processing, pages 1–65. MIT Press, Cambridge, Mass., 1997.

Leave a Reply

Your email address will not be published. Required fields are marked *