Toolkit

The SupWSD toolkit is a supervised word sense disambiguation system. The flexible framework of SupWSD allows users to combine different preprocessing modules, to select features extractors and choose which classifier to use. SupWSD is very light and has very small memory requirements; it provides a simple xml file to configure the disambiguation process.

The SupWSD toolkit requires JRE 1.8 or above.

The zip file is available from the download page.

Installation

To work with the SupWSD toolkit, unpack the zip file with:

unzip SupWSD-TOOLKIT-1.0.0.zip

Create your Eclipse project (File -> New -> Maven or Grandle project, give the project a name and click Finish). This creates a new folder with the project name under the Eclipse workspace folder.

Copy the config and resources folder from the SupWSD-TOOLKIT-1.0.0 folder into your workspace/projectFolder.

Now, include the supwsd-toolkit-1.0.0.jar file in the project build classpath:

  1. Select the project from Package Explorer view (Windows -> Show View -> Package Explorer)
  2. From the menu bar click on Project and then Properties. Select Java Build Path from the contents column on the left and open the Libraries tab
  3. Click on the Add External JARs button, browse to the downloaded SupWSD-TOOLKIT-1.0.0 folder, and select the supwsd-toolkit-1.0.0.jar file

Next, add the required libraries to the project:

Add the following to the pom.xml file:

Copied to clipboard click to copy
<project>
	<dependencies>
		<dependency>
			<groupId>edu.stanford.nlp</groupId>
			<artifactId>stanford-corenlp</artifactId>
			<version>3.8.0</version>
		</dependency>
		<dependency>
			<groupId>edu.stanford.nlp</groupId>
			<artifactId>stanford-corenlp</artifactId>
			<version>3.8.0</version>
			<classifier>models</classifier>
		</dependency>
		<dependency>
			<groupId>edu.mit</groupId>
			<artifactId>jwi</artifactId>
			<version>2.2.3</version>
		</dependency>
		<dependency>
			<groupId>net.sf.jwordnet</groupId>
			<artifactId>jwnl</artifactId>
			<version>1.4_rc3</version>
		</dependency>
		<dependency>
			<groupId>org.apache.commons</groupId>
			<artifactId>commons-collections4</artifactId>
			<version>4.1</version>
		</dependency>
		<dependency>
			<groupId>org.apache.opennlp</groupId>
			<artifactId>opennlp-tools</artifactId>
			<version>1.8.0</version>
		</dependency>
		<dependency>
			<groupId>org.ehcache</groupId>
			<artifactId>ehcache</artifactId>
			<version>3.3.1</version>
		</dependency>
		<dependency>
			<groupId>com.google.code.externalsortinginjava</groupId>
			<artifactId>externalsortinginjava</artifactId>
			<version>0.2.3</version>
		</dependency>
		<dependency>
			<groupId>org.annolab.tt4j</groupId>
			<artifactId>org.annolab.tt4j</artifactId>
			<version>1.2.1</version>
		</dependency>
		<dependency>
			<groupId>tw.edu.ntu.csie</groupId>
			<artifactId>libsvm</artifactId>
			<version>3.17</version>
		</dependency>
		<dependency>
			<groupId>de.bwaldvogel</groupId>
			<artifactId>liblinear</artifactId>
			<version>1.95</version>
		</dependency>
	</dependencies>
</project>

Add the following to the build.gradle file:

Copied to clipboard click to copy
repositories {
	mavenCentral()
}
dependencies {
	compile group: 'edu.stanford.nlp', name: 'stanford-corenlp', version: '3.8.0'
	compile group: 'edu.stanford.nlp', name: 'stanford-corenlp', version: '3.8.0', classifier: 'models'
	compile group: 'edu.mit', name: 'jwi', version: '2.2.3'
	compile group: 'net.sf.jwordnet', name: 'jwnl', version: '1.4_rc3'
	compile group: 'org.apache.commons', name: 'commons-collections4', version: '4.1'
	compile group: 'org.apache.opennlp', name: 'opennlp-tools', version: '1.8.0'
	compile group: 'org.ehcache', name: 'ehcache', version: '3.3.1'
	compile group: 'com.google.code.externalsortinginjava', name: 'externalsortinginjava', version: '0.2.3'
	compile group: 'org.annolab.tt4j', name: 'org.annolab.tt4j', version: '1.2.1'
	compile group: 'tw.edu.ntu.csie', name: 'libsvm', version: '3.17'
	compile group: 'de.bwaldvogel', name: 'liblinear', version: '1.95'
}

Configuration

You can customize the toolkit pipeline using the supconfig.xml file inside config/:

<supwsd xsi:noNamespaceSchemaLocation="supconfig.xsd">
	<working_directory></working_directory>
	<parser mns=""></parser>
	<preprocessor>
		<splitter model=""></splitter>
		<tokenizer model=""></tokenizer>
		<tagger model=""></tagger>
		<lemmatizer model=""></lemmatizer>
		<dependency_parser model=""></dependency_parser>
	</preprocessor>
	<extraction>
		<features>
			<pos_tags cutoff=""></pos_tags>
			<local_collocations cutoff=""></local_collocations>
			<surrounding_words cutoff="" window=""></surrounding_words>
			<word_embeddings strategy="" window="" vectors="" vocab="" cache=""></word_embeddings>
			<syntactic_relations></syntactic_relations>
		</features>
	</extraction>
	<classifier></classifier>
	<writer></writer>
	<sense_inventory dict=""></sense_inventory>
</supwsd>

Working directory

Specify the directory path in the file system where the trained models are to be saved.

Parser

SupWSD has many different parser available, targeted to the various formats of the Senseval/SemeEval WSD competition (both all-words and lexical sample), along with a parser for plain text.

Attributes

Name Type Description
mns string Optional. Path of file containing the lexelt information of testing instances

Preprocessor

This tag can be used to set the components of the preprocessing pipeline.

For each component you can specify the model to be applied using the model attribute. The simple component performs string splitting using the value of the model attribute. If you want to bypass a phase, set the value of the component at none.

Sentence splitter

Tokenizer

Part-Of-Speech tagger

Lemmatizer

Dependency parser

Features

Select which features to use among the 5 standard features.

Set the value of a child to true/false to enable/disable the respective feature.

Part-Of-Speech tags

Part-of-speech tag of the target word and part-of-speech tags surroundings the target word (with a left and a right window of length 3).

Attributes

Name Type Description
cutoff integer Required. Filter the feature according to a minimum frequency threshold. 0 to disable the filter.

Surroundings words

The set of word token (excluding stopwords from a pre-specified list) appearing in the context of the target word.

Attributes

Name Type Description
cutoff integer Required. Filter the feature according to a minimum frequency threshold. 0 to disable the filter.
window integer Required. Number of sentences in the neighborhood of the current sentence that will be used to extract words (-1 to extract all the words).
stopwords string Optional. Path of file containing the list of stop words, one word per line.

Local collocations

Ordered sequences of tokens around the target word.

Attributes

Name Type Description
cutoff integer Required. Filter the feature according to a minimum frequency threshold. 0 to disable the filter.
sequences string Optional. Path of file containing the extraction sequences, a sequence per line.

Word embeddings

Pre-trained word embeddings, integrated according to three different strategies.

Attributes

Name Type Description
cache float Required. Vector cache size as a percentage of the number of vectors.
strategy typed Required. AVG (Average) centroid of the embeddings; FRA (Fractional Decay) weighted vectors on the distance from the target word; EXP (Exponential Decay) weight of vectors decay exponentially.
vectors string Required. Path of file containing the word embeddings, one vector per line.
vocab string Path of file containing vocabulary words, one word per line; this attribute is required only when the cache size is less than 1.
window integer Required. Number of words in the neighborhood of the tag word that will be used to extract the embeddings.

Syntactic relations

A set of features based on the dependency tree of the sentence.

Classifier

Select the machine learning library to run a classification algorithm and generate a model for each senseannotated word type in the input text.

Writer

Choose the preferred way of printing the test results.

Values

Name Description
all Export results to a single file: example
single Generate a file for each test instance: example
plain Create a plain text file, a sentence for each line with senses and probabilities for disambiguated words: example

Sense inventory

Specifying a sense inventory, you can exploit the Most Frequent Sense (MFS) back-off strategy for those target words for which no training data are available. If no sense inventory is specified, the model does not provide an answer and SupWSD will output "U" (unknow sense) as the answer.

Values

Name Description
wordnet Set the attribute dict to point to the directory where the WordNet dictionary is installed.

Furthermore, since supWSD uses JWNL for accessing WordNet, you must define the path of the Wordnet dictionary: In resources/wndictionary/prop.xml you can find the line <param name="dictionary_path" value="dict"/> where the value dict specifies the path of the WordNet dictionary

babelnet If you want to use BabelNet as sense inventory, you can download it from BabelNet download section and follow the Java API guide to configure your project
none Switch off the MFS strategy

Usage examples

The SupWSD class is the entry point of the library and provides two static methods to train and test your datasets:

SupWSD.train("config file path", "dataset file path", "keys file path");
SupWSD.test ("config file path", "dataset file path", "keys file path");

SupWSD will print precision, recall, and F-measure values at the end of the test run.

Customization

Let's now look how to implement new modules for SupWSD and integrate them into the framework at various stages of the pipeline.

New input parser

In order to integrate a new XML parser, it is enough to extend the XMLHandler class and implement the methods startElement, endElement and characters. You can transmit the parsed text to the preprocessing module using the the global variable mAnnotationListener.

import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import it.si3p.supWSD.modules.parser.xml.XMLHandler;

public class NewXMLHandler extends XMLHandler{
	
	@Override
	public void startElement(String uri, String localName, String name, Attributes attributes) throws SAXException {
		
		NewLexicalTag tag...;
			
		switch (tag) {
			...
		}

		push(tag);
	}

	@Override
	public void endElement(String uri, String localName, String name) throws SAXException {

		NewLexicalTag tag...;
		
		switch (tag) {
			...
			case READY:
			
				mAnnotationListener.notifyAnnotations();
		}
		
		pop();
	}

	@Override
	public void characters(char ch[], int start, int length) throws SAXException {

		String sentence = new String(ch,start,length);
		
		switch ((NewLexicalTag)get()) {
			...
		}
	}	
}

Instead, in order to integrate a general parser for, it is enough to extend the Parser class and implement the parse method.

New preprocessing module

To add a new module into the pipeline, it is enough to implement the interfaces in the package modules.preprocessing.units.
It is also possible to add a brand new step to the pipeline (e.g. Named Entity Recognition) by extending the class Unit and implementing the methods to load the models asynchronously.

Adding a new feature

A new feature can be implemented with a two-step procedure:

  1. Create a new class that extends the abstract class Feature. The constructor of this class requires a unique key and a name. It is also possible to set a default value for the feature by implementing the method getDeafultValue.

  2. Implement an extractor for the feature via the abstract class FeatureExtractor. In your constructor invoke the superclass's constructor providing the cut-off value; then, declare the name of the class through the method getFeatureClass.

import java.util.Collection;
import it.si3p.supWSD.data.Annotation;
import it.si3p.supWSD.data.Lexel;
import it.si3p.supWSD.modules.extraction.features.Feature;

public abstract class FeatureExtractor {

	private final int mCutOff;

	public FeatureExtractor(int cutoff) {

		mCutOff = cutoff;
	}

	public final int getCutOff() {

		return mCutOff;
	}
	
	public abstract Class<? extends Feature> getFeatureClass();
	public abstract Collection<Feature> extract(Lexel lexel, Annotation annotation);
}

Adding a new classifier

A new classifier can be implemented by extending the generic abstract class Classifier, which declares the methods to train and test the models.
Feature conversion is carried out with the generic method getFeatureNodes.

import java.util.Collection;
import java.util.SortedSet;
import it.si3p.supWSD.modules.classification.instances.AmbiguityTest;
import it.si3p.supWSD.modules.classification.instances.AmbiguityTrain;
import it.si3p.supWSD.modules.classification.scorer.Result;
import it.si3p.supWSD.modules.extraction.features.Feature;

public abstract class Classifier<T,V>{

	public abstract Object train(AmbiguityTrain ambiguity);
	protected abstract double[] predict(T model, V[] featuresNodes);
	protected abstract V[] getFeatureNodes(SortedSet<Feature> features);
	
	public final Collection<Result> evaluate(AmbiguityTest ambiguity, Object model, String cls){
		...
	}
}