Services

WordTies: A Nordic/Baltic Multilingual Wordnet Initiative WordTies describes a multilingual wordnet initiative embarked in the META-NORD/ META-NET projects and originally concerned with the validation and pilot linking between Nordic and Baltic wordnets. Wordnets in Nordic/Baltic countries. The builders of these wordnets have applied very different compilation strategies: The Danish and Swedish wordnets are being developed via monolingual dictionaries and corpora and subsequently linked to Princeton WordNet. In contrast, the Finnish and Norwegian wordnets are applying the expand method by translating from Princeton WordNet and the Danish wordnet, DanNet, respectively. The Estonian wordnet was built as part of the EuroWordNet project and by translating the base concepts from English as a first basis for monolingual extension. Recently, Polish wordnet, plWordNet, has been added to WordTies. This wordnet is built using a monolingual approach similar to one used in the Danish and Swedish wordnets.
FreeLing Morphosyntactic analyzer Web Service v.2.1 This Web Service deploys a FreeLing-based morphological analyzer. The languages supported are English, Catalan, Spanish, Asturian, Welsh, Galician, Italian, Russian and Portuguese. WARNING: This WS has a new version.
P clue/ lexical class from Weka computer Web Service Given a training set encoded as vectors of cue (or feature) occurrences in weka format, this web service computes P(cuei|class): the probability of seeing each cue as a member or non-member of the class using MLE approach (counts frequencies of appearance of each cue in each class). Inputs: - weka_signatures: classified instances encoded as cue vectors. Each slot of the vector contains the number of times each feature has been observed for that instance. Also, we add special slots: total number of occurrences in the first slot and correct class (1 or 0) and lemma (or any identifier of the instance) in the two last slots. The vectors should be encoded in a weka file, in UTF.8. The cue counts must be encoded as integers, this is, no relative frequency needs to be given but the number of times each cue has been seen. For example, for the class of eventive nouns in English we would have some examples of eventive nouns and some of non-eventive nouns: @relation eventive.arff' @attribute [total_occurences] numeric @attribute [cue_1] numeric @attribute [cue_2] numeric … @attribute [cue_n] numeric @attribute [eventive] {0,1} @attribute [lemma] string @data 5,0,0,…,3,0,visa 386,0,1,…,162,0,characteristic 23,1,0,… ,0,1,ceremony 270,0,2,…,0,1,assembly Outputs: The output is a comma separated file with the frequencies each cue has been observed with members and non-members of the class. Also, information about the number of tokens in each class is given. Example: #cue;data size class;data p class;data size no class;data p no class; cue_1;1301732;0.000883438372876;1516522;0.00137419701132; cue_2;1301732;0.000520076329075;1516522;0.000600716639785; ... cue_n;1301732;0.0243222107162;1516522;0.0177992801951;
Abstract nouns classifier Web Service This WS identifies abstract nouns in a part of speech tagged text (with FreeLing Morphosyntactic tagger V 3.0 WS). The classification is performed with a pre-trained Decision Tree. The output is a LMF file with the classifier prediction for each noun. You can choose to have this prediction as: - scored: each noun gets a score of being or not being a member of the class (bigger than 0 means class member, smaller, non member of the class) - filtered: the nouns are filtered according to their score. If the score is positive and over a determined threshold the noun is considered to be a member of the class. If it is negative and under another threshold, it is considered to be a non-member of the class. The other cases are tagged as unknown, since the classifier did not give enough confidence to their classification. The used thresholds are pre-set according to some experiments, if you want to use your own thresholds, you should get the scored output and use the filter webservice to filter it with your thresholds. The languages supported are Spanish and English.
Anonymizer Web Service This WS substitutes proper nouns with tags. This process anonymizes an input text by eliminating any person, place, corporation, etc. name. The service automatically calls the FreeLing WS and makes use of its Named Entity Recognition tool to detect proper nouns. The languages supported are English, Catalan, Spanish, Asturian, Welsh, Galician, Italian, Russian and Portuguese.
Artifact nouns classifier Web Service This WS identifies artifact nouns in a part of speech tagged text (with FreeLing Morphosyntactic tagger V 3.0 WS). The classification is performed with a pre-trained Decision Tree. The output is a LMF file with the classifier prediction for each noun. You can choose to have this prediction as: - scored: each noun gets a score of being or not being a member of the class (bigger than 0 means class member, smaller, non member of the class) - filtered: the nouns are filtered according to their score. If the score is positive and over a determined threshold the noun is considered to be a member of the class. If it is negative and under another threshold, it is considered to be a non-member of the class. The other cases are tagged as unknown, since the classifier did not give enough confidence to their classification. The used thresholds are pre-set according to some experiments, if you want to use your own thresholds, you should get the scored output and use the filter webservice to filter it with your thresholds. The languages supported are Spanish and English.
Basic XCES to TXT converter Web Service A WS that converts BasicXCES text corpus in plain text (.TXT).
Bayesian parameter estimation Web Service Given a training set encoded as vectors of cue (or feature) occurrences, this web service estimates the parameters P(cuei|class): the probability of seeing each cue as a member or non-member of the class. This estimation is performed using Bayesian inference, which combines prior knowledge with observed data. The parameters estimated with this web service can be used, for example, to classify new instances using a Naive Bayes classifier. The output format is the one needed as input for the naive_bayes_classifier webservice.
Bohnet parser Web Service This WS performs dependency parsing using Bohnet's graph-based Parser. The input is text in plain text or CoNLL format. The languages supported are English and Spanish.
CQP Analyzer Web Service This WS allows analyzing an already indexed corpus (see CQP indexer WS for indexing details). The WS returns an Excel file with some statistical metrics such as number of nouns, verbs, ngrams, etc. The languages supported are Spanish and English.
CQP indexer Web Service CQP indexer WS based on the IMS Open Corpus Workbench (CWB). The input is an annotated corpus in tabular format. The output is the Corpus ID to be used by the CQPquery Web Service. Language independent WS.
CQP query Web Service This WS allows querying an already indexed corpus (see CQP indexer WS for indexing details). The WS is based on the IMS Open Corpus Workbench (CWB). Language independent WS.
Columns selector Web Service This WS allows extracting a column from a tabular file input text. It is useful to work with CoNLL or FreeLing annotated corpora. Language independent WS.
ContaWords ContaWords is a web application that reads the words of a text file and decides what part of speech to assign to each word (credit-Noun-credit but credit-Verb-to_credit). It then begins to count how many times a word appears in the text in every possible way (credits, credit, credited… etc). It can also take into account that a sequence of words can, for example, be the name of an organization, a person or an entity (Named Entity Recognition). Besides all this, ContaWords can find sequences of two words that seem to appear together more often than would be expected, such as "monetary policy "@en or "fiscal balance ".
Corpus to vectors Web Service This WS converts a corpus to Weka vector arff file. The language supported are Asturian, Catalan, English, Galician, Italian, Portuguese, Russian, Spanish, and Welsh.
EUSMT Statistical Machine Translation from Spanish to Basque. Use of segmentation and reordering in Statistical Machine Translation from Spanish to Basque. It allows our system to achieve a relative improvement of 10% in the HTER metric.
Eustagger Lemmatizer. Eustagger is a robust and wide-coverage morphological analyser and a Part-of-Speech tagger for Basque. The analyser is based on the two-level formalism and has been designed in an incremental way with three main modules: the standard analyser, the analyser of linguistic variants, and the analyser without lexicon which can recognize word-forms without having their lemmas in the lexicon. Using lexical transducers for our analyser we have improved both the performance of the different components of the system and the description itself. Provides possible lemmas, PoS and other morphological information for a token. It also recognizes date/time expressions, numbers. In the tagger combination of stochastic and rule-based disambiguation methods are applied to Basque language. The methods we have used in disambiguation are Constraint Grammar formalism and an HMM based tagger. CG rules are applied using all the morphological features and this process decreases morphological ambiguity of texts. Finally, we use the stochastic tool to select just one from the possible remaining tags. Using only the stochastic method the error rate is about 14%, but the accuracy may be increased by about 2% enriching the lexicon with the unknown words. When both methods are combined, the error rate of the whole process is 3.5%. Tokenization, morphological analysis, lemmatization and tagging for Basque. There is a web service.
Eventive nouns classifier Web Service This WS identifies eventive nouns in a part of speech tagged text (with FreeLing Morphosyntactic tagger V 3.0 WS). The classification is performed with a pre-trained Decision Tree. The output is a LMF file with the classifier prediction for each noun. You can choose to have this prediction as: - scored: each noun gets a score of being or not being a member of the class (bigger than 0 means class member, smaller, non member of the class) - filtered: the nouns are filtered according to their score. If the score is positive and over a determined threshold the noun is considered to be a member of the class. If it is negative and under another threshold, it is considered to be a non-member of the class. The other cases are tagged as unknown, since the classifier did not give enough confidence to their classification. The used thresholds are pre-set according to some experiments, if you want to use your own thresholds, you should get the scored output and use the filter webservice to filter it with your thresholds. The languages supported are Spanish and English.
File splitter Web Service This WS splits an input file into smaller files containing the number of lines indicated as input parameter. Splitted files are stored in the results public directory, and the output is a file with the list of URLs pointing to each splitted file. Language independent WS.
FreeLing Chunker parser Web Service v.2.1 Freeling-based chunker parser. The languages supported are English, Catalan, Spanish, Asturian and Galician. WARNING: This WS has a new version.
FreeLing Chunker parser Web Service v.3 This WS performs a FreeLing-based chunker parser (v 3.0). The WS requires a plain text input. The possible outputs formats are FreeLing , XML, and XML CQP ready. The languages supported are English, Catalan, Spanish, Asturian and Galician.
FreeLing Dependency parser Web Service v.3 This WS deploys a FreeLing-based dependency parser (v 3.0). The WS requires a plain text input. The possible outputs formats are FreeLing, XML, and XML CQP ready. The languages supported are English, Catalan, Spanish, Asturian and Galician.
FreeLing Dependency parser Web Service v.2.1 Freeling-based dependency parser. The languages supported are English, Catalan, Spanish, Asturian and Galician. WARNING: This WS has a new version.
FreeLing Morphosyntactic analyzer Web Service v.3 This Web Service deploys a FreeLing-based morphological analyzer (v 3.0). The languages supported are English, Catalan, Spanish, Asturian, Welsh, Galician, Italian, Russian and Portuguese.
FreeLing Morphosyntactic tagger Web Service v.3 This WS performs a FreeLing-based part-of-speech tagger (v 3.0). WS job duration depends on the server load, approximately 1 million words takes one minute. The languages supported are English, Catalan, Spanish, Asturian, Welsh, Galician, Italian, and Portuguese. The output is a tabular file with: word form, lemma, part of speech, confidence rate, initial position and final position.
FreeLing Morphosyntactic tagger Web Service v.2.1 This WS performs a FreeLing-based part-of-speech tagger. WS job duration depends on the server load, approximately 1 million words takes one minute. The languages supported are English, Catalan, Spanish, Asturian, Welsh, Galician, Italian, and Portuguese. WARNING: This WS has a new version.
FreeLing Name Entity Recognition Web Service This Web Service deploys a FreeLing-based name entity recognizer (v 3.0). The languages supported are English, Catalan, Spanish, Asturian, Welsh, Galician, Italian, Russian and Portuguese.
FreeLing Sentence Splitter Web Service v.2.1 This WS performs a FreeLing-based sentence splitter. The WS splits a file in plain text format and UTF-8 encoded into units (tokens). Output sentences are separated by empty lines. The languages supported are English, Catalan, Spanish, Asturian, Welsh, Galician, Italian, Russian and Portuguese. WARNING: This WS has a new version.
FreeLing Sentence Splitter Web Service v.3 This WS performs a FreeLing-based sentence splitter (v 3.0). The WS splits a file in plain text format and UTF-8 encoded into units (tokens) separated by new lines. Output sentences are separated by empty lines. The languages supported are English, Catalan, Spanish, Asturian, Welsh, Galician, Italian, Russian and Portuguese.
FreeLing Tokenizer Web Service v.2.1 This WS deploys a FreeLing-based text tokenizer. The WS splits a file in plain text format and UTF-8 encoded into units (tokens). The languages supported are Catalan, English, Galician, Italian, Portuguese, Russian, Spanish, Welsh, and Asturian. WARNING: This WS has a new version.
FreeLing Tokenizer Web Service v.3 This WS deploys a FreeLing-based text tokenizer (v 3.0). The WS splits a file in plain text format and UTF-8 encoded into units (tokens) where tokens are separated by new lines. The languages supported are Catalan, English, Galician, Italian, Portuguese, Russian, Spanish, Welsh, and Asturian.
GrAF from dependency converter Web Service This WS is a Panacea project converter that creates GrAF elements from dependency parser output.
GrAF skeleton from basic XCES converter This WS is a Panacea project converter that creates GrAF skeleton from BASIC XCES documents.
Gretel 2.0 GrETEL stands for Greedy Extraction of Trees for Empirical Linguistics. It is a user-friendly search engine for the exploitation of treebanks. It comes in two formats: a) Example-based search: in this search mode you can use a natural language example as a starting point for searching a treebank (a text corpus with syntactic annotations) with limited knowledge about tree representations and formal query languages (the formal (XPath) query is automatically generated). b) XPath search: in this search mode you have to build the XPath query yourself (for experienced XPath users).
HTML to text converter Web Service A WS to convert HTML documents to plain text format. Language independent WS.
Human nouns classifier Web Service This WS identifies human nouns in a part of speech tagged text (with FreeLing Morphosyntactic tagger V 3.0 WS). The classification is performed with a pre-trained Decision Tree. The ouptut is a LMF file with the classifier prediction for each noun. ou can choose to have this prediction as: - scored: each noun gets a score of being or not being a member of the class (bigger than 0 means class member, smaller, non member of the class) - filtered: the nouns are filtered according to their score. If the score is positive and over a determined threshold the noun is considered to be a member of the class. If it is negative and under another threshold, it is considered to be a non-member of the class. The other cases are tagged as unknown, since the classifier did not give enough confidence to their classification. The used thresholds are pre-set according to some experiments, if you want to use your own thresholds, you should get the scored output and use the filter webservice to filter it with your thresholds. The languages supported are Spanish and English.
Hungalign to GrAF converter Web Service This WS creates an alignment file combining the Hunalign output and two sentences id lists extracted from GrAF documents.
IULA GrAF tagger Web Service This WS converts the results of IULA tagger (PoS tagger) in GrAF output.
IULA TreeTagger Web Service This WS is a morphosyntatic tagger. The disambiguation process is done by a TreeTagger instance trained by the IULA. The input is plain text in Catalan or Spanish. The output allows optional formats and optional encoding. (http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/)
IULA character encoding converter Web Service This WS converts the character encoding of given files from one encoding to another. Based on the Linux 'iconv' command used to convert between different character encodings.
IULA character encoding converter Web Service Convert character encoding of given files from one encoding to another. Based on the Linux 'iconv' command that converts text from one encoding to another encoding.
IULA concordancer Web Service Given a lemma and a category, this WS returns the sentences of the IULA corpus where this lemma occurs. The user can perform a domain search. The languages supported are Spanish and English.
IULA lexicon look up Web Service Given a word form, this WS returns the lexical information by looking it up in the IULA's lexicon. The languages supported are Catalan, Spanish or English.
IULA paradigma Web Service Given a verb (infinitive or a verbal form) this WS outputs its verbal paradigm grouped according tense and mode. The languages supported are Catalan and Spanish.
IULA preprocess Web Service This WS provides a text segmentation into minor structural units (titles, paragraphs, sentences, etc.); detection of entities (not found in a dictionary: numbers, abbreviations, URLs, emails, etc.); and the keeping of sequences of two or more words in a single block (dates, phrases, etc.). The input is plain text in Catalan and Spanish.
IULA tokenizer Web Service The IULA tokenizer WS splits a file in plain text format and UTF-8 encoded into units (tokens). The languages supported are Catalan and Spanish.
IxaPipes A modular set of Natural Language Processing tools for English and Spanish. IXA pipes is a modular set of Natural Language Processing tools (or pipes) which provide easy access to NLP technology for English and Spanish. It offers robust and efficient linguistic annotation to both researchers and non-NLP experts with the aim of lowering the barriers of using NLP technology either for research purposes or for small industrial developers and SMEs. The ixa-pipes tools can be used or exploit its modularity to pick and change different components. These are the different components: - ixa-pipe-tok: Tokenizer and Segmenter for several languages. - ixa-pipe-pos: POS tagger for Spanish and English. - ixa-pipe-nerc: Named Entity Recognition tagger for Spanish and English. - ixa-pipe-parse: Probabilistic constituent parser for Spanish and English. - ixa-pipe-coref: Coreference resolution tool. Soon to be available!! Every ixa pipe tool can be up an running after two simple steps. The tools require Java 1.7+ to run and are designed to come with all batteries included, which means that it is not required to do any system configuration or install any third-party dependencies. The modules will run on any platform as long as a JVM 1.7+ is available. IXA pipes are just a set of processes chained by their standard streams, in a way that the output of each process feeds directly as input to the next one. The Unix pipes metaphor has been applied for NLP tools by adopting a very simple and well known data centric architecture, in which every module/pipe is interchangeable by any other tool as long as it reads and writes the required data format via the standard streams. The data format in which both the input and output of the modules needs to be formatted to represent and pipe linguistic annotations is NAF. Our Java modules all use the kaflib library for easy NAF integration.
Ixati Chunking for Basque. There is a web service. Zatiak performs shallow syntactic analysis of a sentence. This program reads an input text and, after morphological processing, identifies pieces of text (chunks). Each chunk is marked with its type: nominal phrase (NP or PP) or verb chain, together with its associated information: grammatical case, number, definiteness and syntactic functions, among others.
Keeleveeb query Keeleveeb is a portal, where one can run queries on several dictionaries and corpora. There are 12 Estonian monolingual dictionaries, 12 bilingual dictionaries (one of them Estonian), 19 Specialty dictionaries, 15 Learner dictionaries (bilingual, Estonian-Russian-Estonian), 23 corpora, and 3 tools integrated into the queries.
LMF file merger Web Service Given two LMF files, this webservice merges them into a single LMF file. It works for LMF files encoding the information in the same way, i.e. same labels, values and structure. This will work, for example, for merging different lexica learnt under PANACEA platform. If the LMF files contain equivalent information encoded in different ways, a mapping into a common format should be previously performed. WARNING: this version of the webservice only works for LMF files without references. This is, LMF files containing only “<LexicalEntry>” elements. I.e. it works for morphological dictionaries, noun classification, etc but not for SCF lexica, for example. SCF lexica define first the frames associated to each verb and then the frames itself. The link between the two is made using the IDs. This version of the webservice is not able to follow such links, so it won’t perform the merging of elements outside “<LexicalEntry>” correctly. This will be developed in the future.
LMF merger Web Service Given a list of URLs pointing to LMF files, this webservice merges them into a single LMF file. It works for LMF files encoding the information in the same way, i.e. same labels, values and structure. This will work, for example, for merging different lexica learnt under PANACEA platform. If the LMF files contain equivalent information encoded in different ways, a mapping into a common format should be previously performed. This webservice is a generalization of merge_lmf_files (http://services.iula.upf.edu/services/245), which merges only two files. See the documentation in merge_lmf_files for details. WARNING: this version of the webservice only works for LMF files without references. This is, LMF files containing only “<lexicalentry>” elements. I.e. it works for morphological dictionaries, noun classification, etc but not for SCF lexica, for example. SCF lexica define first the frames associated to each verb and then the frames itself. The link between the two is made using the IDs. This version of the webservice is not able to follow such links, so it won’t perform the merging of elements outside “<lexicalentry>” correctly. This will be developed in the future.
Lexical classifier Web Service Given a set of signatures in a weka file (test_file.arff), classify them using the parameters estimated for each cue (theta_file.csv).
Linescrambler Web Service This WS scrambles the lines in a file. The goal is to make it difficult to reproduce the original text. The input size limit is 100 MB. Language independent WS.
Linescrambler parallel Web Service This WS will scramble the lines in a parallel text corpus keeping the alignment. The goal is to make it difficult to reproduce the original text. The input size limit is 100 MB. Language independent WS.
Location nouns classifier Web Service This WS identifies location nouns in a part of speech tagged text (with FreeLing Morphosyntactic tagger V 3.0 WS). The classification is performed with a pre-trained Decision Tree. The output is a LMF file with the classifier prediction for each noun. You can choose to have this prediction as: - scored: each noun gets a score of being or not being a member of the class (bigger than 0 means class member, smaller, non member of the class) - filtered: the nouns are filtered according to their score. If the score is positive and over a determined threshold the noun is considered to be a member of the class. If it is negative and under another threshold, it is considered to be a non-member of the class. The other cases are tagged as unknown, since the classifier did not give enough confidence to their classification. The used thresholds are pre-set according to some experiments, if you want to use your own thresholds, you should get the scored output and use the filter webservice to filter it with your thresholds. The languages supported are Spanish and English.
MS Word to text converter Web Service A WS to convert MS Word documents to plain text format. Language independent WS.
MaltParser Web Service This WS calls an instance of MaltParser for Spanish trained with the IULA treebank developed in the Metanet4you project. The input of this WS is plain text. The service performs PoS tagging with FreeLing and then performs the dependency parsing using Malt parser. The output follows CoNLL format.
Maltixa Statistic-based dependency parser. Given a set of sentences in Basque, each sentence in a line, obtains a dependency-analysis of the sentences in a format equivalent (although not totally equal, as the columns appear in a different order) conll format.
Matter nouns classifier Web Service This WS identifies matter nouns in a part of speech tagged text (with FreeLing Morphosyntactic tagger V 3.0 WS). The classification is performed with a pre-trained Decision Tree. The output is a LMF file with the classifier prediction for each noun. You can choose to have this prediction as: - scored: each noun gets a score of being or not being a member of the class (bigger than 0 means class member, smaller, non member of the class) - filtered: the nouns are filtered according to their score. If the score is positive and over a determined threshold the noun is considered to be a member of the class. If it is negative and under another threshold, it is considered to be a non-member of the class. The other cases are tagged as unknown, since the classifier did not give enough confidence to their classification. The used thresholds are pre-set according to some experiments, if you want to use your own thresholds, you should get the scored output and use the filter webservice to filter it with your thresholds. The languages supported are Spanish and English.
Matxin Machine translation from Spanish to Basque. Matxin is a Transfer-based MT system from Spanish into Basque. It is an open, reusable and interoperable framework which can be improved in the next future combining it with the statistical model. The MT architecture reuses several open tools and it is based on an unique XML format for the flow between the different modules, which makes easier the interaction among different developers of tools and resources. Being Basque a resource-poor language this is a key feature in our aim for future improvements and extensions of the engine. The result is an open source software which can be downloaded from matxin.sourceforge.net. We think it could be adapted to translating between other languages with few resources.
Migmap Migmap is a web application where the user first chooses generation (forward or backward in time) and gender, while the migration map of The Netherlands related to an interactively pointed municipality (or other aggregation unit) is shown. The existing map-making software module "Kaart" of the Meertens Institute has been transformed into a generic, standards-based tool for the creation and presentation of maps with complex spatio-temporal diffusion data in a user friendly and interactive way. It is based on the availability of places of birth and residence (in 2006) of the Dutch population (16 million alive, 6 million deceased but included) and their family relations from the Civil Registration, so that migrations patterns between municipalities (and immigration from abroad) can be presented over three generations in the 20th century.
Mimore It is a web application that enables simultaneous search in three micro-comparative databases on Dutch dialects via a common interface. This makes it possible to investigate potential correlations between variables at the three different linguistic levels. Cartographic functionality enables the user to visualize these correlations and set theoretic functionality to analyze them.
Morfeus Morphological analyzer.
Naive Bayes classifier Web Service This webservice performs traditional Naive Bayes classification of instances given in a weka file. It outputs the predicted classification for each instance and some statistics about the performance of the classification. The parameters needed as input can be learnt using estimate_bayesian_parameters webservice.
P clue/ lexical class calculator Web Service This WS calculates the probability of seeing a linguistic cue given a lexical class (P(cue|class) value). This probability is computed given the occurrences of cues in a corpus (codified in the signatures file) and the information of belonging or not belonging of these words to different classes (codified in indicators file). The probability is computed for each studied cue in the signatures file and for each class in the indicators file.
P clue/ lexical class computer Web Service This WS calculates the probability of seeing a linguistic cue given a lexical class (P(cue|class) value). This probability is computed given the occurrences of cues in a corpus (codified in the signatures file) and the information of belonging or not belonging of these words to different classes (codified in indicators file). The probability is computed for each studied cue in the signatures file and for each class in the indicators file.
PANACEA converter Web Service This is the Panacea conversion tool.
PDF to text converter Web Service This WS converts PDF documents to plain text format. Language independent WS.
PML-TQ search engine and interface PML-TQ is a powerful open-source search tool for all kinds of linguistically annotated treebanks with several client interfaces and two search backends (one based on a SQL database and one based on Perl and the TrEd toolkit). The tool works natively with treebanks encoded in the PML data format (conversion scripts are available for many established treebank formats).
PoS tagger to Xces converter Web Service A WS to convert PoS Tagger formats to XCES.
Post tagging to GrAF converter Web Service This WS is a Panacea project converter that creates GrAF documents from the output of PoS taggers (Freeling and IULA tagger).
Process nouns classifier Web Service This WS identifies process nouns in a part of speech tagged text (with FreeLing Morphosyntactic tagger V 3.0 WS). The classification is performed with a pre-trained Decision Tree. The output is a LMF file with the classifier prediction for each noun. You can choose to have this prediction as: - scored: each noun gets a score of being or not being a member of the class (bigger than 0 means class member, smaller, non member of the class) - filtered: the nouns are filtered according to their score. If the score is positive and over a determined threshold the noun is considered to be a member of the class. If it is negative and under another threshold, it is considered to be a non-member of the class. The other cases are tagged as unknown, since the classifier did not give enough confidence to their classification. The used thresholds are pre-set according to some experiments, if you want to use your own thresholds, you should get the scored output and use the filter webservice to filter it with your thresholds. The language supported is Spanish.
Provenance collector Web Service This WS collects all the headers of input XML files used in a Taverna workflow. The metadata that can be stored in the resulting XML file are: 1) workflow name, 2) workflow myExperiment link, 3) processors list, and 4) list of XML headers.
Search signatures Web Service Given a list of lemmas, the WS looks for the occurrences of them in IULA corpus, applies the given regular expressions and returns all the signatures.
Select Nouns from LMF lexicon Web Service Given a LMF file with nouns classified with a score (see Nouns classifier Web Services), this WS filters the nouns with confidence over a desired threshold. Language independent WS.
Semiotic nouns classifier Web Service This WS identifies semiotic nouns in a part of speech tagged text (with FreeLing Morphosyntactic tagger V 3.0 WS). The classification is performed with a pre-trained Decision Tree. The output is a LMF file with the classifier prediction for each noun. You can choose to have this prediction as: - scored: each noun gets a score of being or not being a member of the class (bigger than 0 means class member, smaller, non member of the class) - filtered: the nouns are filtered according to their score. If the score is positive and over a determined threshold the noun is considered to be a member of the class. If it is negative and under another threshold, it is considered to be a non-member of the class. The other cases are tagged as unknown, since the classifier did not give enough confidence to their classification. The used thresholds are pre-set according to some experiments, if you want to use your own thresholds, you should get the scored output and use the filter webservice to filter it with your thresholds. The language supported is Spanish .
Soaplab validator Web Service This WS verifies that a Soaplab web service is Panacea compliant.
Social nouns classifier Web Service This WS identifies social nouns in a part of speech tagged text (with FreeLing Morphosyntactic tagger V 3.0 WS). The classification is performed with a pre-trained Decision Tree. The output is a LMF file with the classifier prediction for each noun. You can choose to have this prediction as: - scored: each noun gets a score of being or not being a member of the class (bigger than 0 means class member, smaller, non member of the class) - filtered: the nouns are filtered according to their score. If the score is positive and over a determined threshold the noun is considered to be a member of the class. If it is negative and under another threshold, it is considered to be a non-member of the class. The other cases are tagged as unknown, since the classifier did not give enough confidence to their classification. The used thresholds are pre-set according to some experiments, if you want to use your own thresholds, you should get the scored output and use the filter webservice to filter it with your thresholds.The languages supported are Spanish and English.
Stream editor Web Service (sed) This WS performs basic text transformations on an input text. The serveice is based on the 'sed' progam, a Unix utility that parses and transforms text, using a simple, compact programming language.
TF-IDF calculator Web Service This WS calculates the Term Frequency (TF) and the Inverse Document Frequency (IDF) of a word in a given corpus. The two values, labeled TF-IDF, are a statistical measure used to evaluate how important a word is to a document in a collection or corpus.
TGZ file compressor Web Service This WS creates a compress file (in TGZ format) with output documents stored on this same server using their URL
TMX shuffling Web Service This WS randomizes the order of the translation units in TMX files. The goal is to make it difficult to reproduce the original text. The input size limit is 100 MB. Language independent WS.
TRL Malt Parser module for Spanish The file espmalt-1.0.mco contains a single malt configuration for parsing Spanish text with MaltParser. The parser presupposes that the input is in CoNLL-X format and tagged with the part-of-speech tags of FreeLing tagger.
Ted Pedersen's Ngram Statistics Package Ted Pedersen's Ngram Statistics Package (used to identify word Ngrams that appear in large corpora using standard tests of association such as Fisher's exact test, the log likelihood ratio, Pearson's chi-squared test, the Dice Coefficient, etc.).
Ted Pedersen's Ngrams Counter Web Service This WS performs the Count function from Ted Pedersen's Ngram Statistics Package (used to identify word Ngrams that appear in large corpora using standard tests of association such as Fisher's exact test, the log likelihood ratio, Pearson's chi-squared test, the Dice Coefficient, etc.). Language independent WS.
Ted Pedersen's Text Similarity Web Service This WS is based on Ted Pedersen's Text Similarity module. It measures the similarity of two documents based on the number of shared words scaled by the lengths of the files. Text Similarity WS computes the F-Measure, the Dice Coefficient, the Cosine, and the Lesk measure. Language independent WS.
Textual emigration analysis Historians, literary scientists, and others are interested in the semantic interpretation of text. With automatic pre-processing of texts, e.g. named entity recognition, coreference resolution, and dependency parsing, relevant semantic relations can be extracted. The Stuttgart tools extract relations of migration, i.e. name and job of the migrant and date, reason and destination of migration. A graphical interface showing timeline and geographic migrations will be implemented.
The Glossa corpus search system New version of the corpus search and post-processing tool Glossa. While the old version was tightly coupled to the IMS Corpus Workbench (CWB) and could only search in CWB-encoded corpora, the new version is flexible with respect to search engines and can even search in corpora located on other servers using the CLARIN federated content search. Incoming searches from other search agents in the CLARIN network is also supported (although AAI support is not yet implemented).
Twitter NLP Web Service This WS is based on the Twitter NLP tool developed by Noah's ARK group (Noah Smith's research group at the Language Technologies Institute, School of Computer Science, Carnegie Mellon University). A fast and robust Java-based tokenizer and part-of-speech tagger for Twitter, its training data of manually labeled POS annotated tweets, a web-based annotation tool, and hierarchical word clusters from unlabeled tweets. The language supported is English.
Vocabulary analyzer Web Service This WS calculates different lexicometric measures and displays them graphically (tokens, types, hapaxes and type/token ratio). The input is a plain text corpus with one token per line. Language independent WS.
WSD-IXA Word-Sense Disambiguation. The WSD system is based on the well known Support Vectors Machine (SVM) Algorithm. This system has been trained on EuSemCor corpus (the unique basque corpus semantically tagged). Due to corpus's reduced size, the WSD system has been trained for 402 polysemous nouns. Perl CGI script runs the input raw text over Eustagger basque lemmatizer in order to extract features. Then, the feature-vector is classified by the WSD (SVM) system. Finally, the CGI manage classifier and lemmatizer output in order to show in a proper format.
Weka noun signatures creator Web Service This web service creates a weka file containing context information of a list of nouns in a given corpus. The context information for each noun is extracted using a set of Regular Expressions and it is encoded in one vector (one line per noun in the weka file). Each slot in the vector represents the number of times the regular expression in this position has been observed with the given noun. Inputs: - corpusId: Already indexed CQP corpus ID from which to extract the signatures. You can index your PoS tagged corpus using cqp_index web service. - regularExpressions: List of Regular Expressions to be applied separated by line breaks. The order of the REs in this file will be the order in the weka vectors. Optional parameters: - className: Name of the class to be included in the weka file. - indicators: Indicators file informing about the belonging of different nouns to the studied class. Format: one word per line with binary values of belonging/not belonging to the class separated by tab. In UTF-8. Example: - lemmas: If the information about belonging, not belonging to the class (indicators) is not available, you may want to include a list of nouns to be processed. The format is a list of lemmata separated by line breaks, in UTF-8. If this and indicators fields are empty, all nouns in corpus will be processed (may take a long time). - minOccurrences: minimum number of times a noun has to be seen in the corpus to be included in the output file. If a list of lemmas is given, by default minOccurrences is set to 1. - vector_type: type of vector desired at the output. 2.2 Outputs - weka: weka file with noun vectors found in the given corpus. - notFoundLemmas: list of lemmas that did not appear in the corpus more than the minOccurrences threshold. - concordances: sentences in the corpus in which the selected nouns appear and informationa bout which Regular Expressions matched in each sentence. Useful for developing and testing the Res.
XML/TXT to Weka converter Web Service Given a XML signatures file (signatures.xml) and the indicators file (indicators.txt) with the nouns that belong or not to the class, this WS creates a file in ARFF format to experiment with Weka. Warning: the default encoding for input and outputs files is ISO-8859-1. It may be changed using optional parameters, but the two input files must have the same encoding, which must be indicated in the headers of the XML file.
XSLT applicator Web Service A command line tool for applying XSLT stylesheets to XML documents.
Diccionario Básico Escolar Students basic dictionary (Cuba). The GUI of the Diccionario Básico Escolar allows, besides common dictionary lookup, detecting the most common misspellings, consulting verb conjugation, syllabification of the headwords and, in some cases, watching illustrations attached to the entries. This electronic dictionary has been developed with the Centro de Lingüística Aplicada institute in Santiago de Cuba. There are two versions, one in CD (which has been distributed in cuban schools) and the web version. This project has been funded by the Basque Government (in the FOCAD call).
BASYQUE BASYQUE (Base de Données Syntaxique Basque) is the web application we have developed to store, organize, manage and search for all the information concerning dialectal variation in Basque speaking areas, and specifically, in the North-Eastern Basque dialects. In order to collect and analyze data, we define some specific questionnaires. Each questionnaire tackles a linguistic phenomenon that undergoes syntactic variation. Those questionnaires are conducted to some informants of different age range who are selected from several locations of the Northern Basque Country. We record their answers and collect the recorded examples as well as their corresponding information in a database. All these data, which can be consulted in BASYQUE, constitute the main source of information of this project.
BertsolariXa Application that finds words ended by the character sequence given by the user. BertsolarIXA is able to find not only lemmas but also inflected forms. Results can be filtered by the domain and phonetic rules can also be applied. It is a tool aimed to help verse-makers.
Eihera Eihera is a system for Named Entity recognition and classification in written Basque. The system is designed in four steps: first, the development of a recognizer based on linguistic information represented on finite-state-transducers; second, the generation of semi-automatically annotated corpora from the result of these transducers; third, the achievement of the best possible recognizer by training different ML techniques on these corpora; and finally, the combination of the different recognizers obtained. Eihera classifies the named entities into three classes: person, organization and location.
Ihardetsi A Question-Answering system for the area of Science and Technology. Ihardetsi is a question answering system for Basque. It is a general platform which architecture pays special attention to: 1) the integration of the development and evaluation environments, and 2) the systematic use of XML declarative files to control the execution of the modules and the communication between them.
Xuxen Spelling corrector on-line. Xuxen is a spelling corrector for Basque integrated in MS-Office, OpenOffice, Firefox, OCR programs and others. It can be downloaded from the Basque Govern's website (> 25.000 downloads) Eleka is the company which manages it now. The fact that Basque is a highly inflected language makes the correction of spelling errors extremely difficult because collecting all possible word-forms in a lexicon is an endless task. The simplicity of English inflections made of reduced interest in the research on morphological analysis by computer. In English, the most common practice is to use a lexicon of all of the inflected forms or a minimum set of morphological rules. Within this context we have implemented XUXEN, our spelling checker-corrector (Aduriz et al., 1997). It completely covers the standard language defined by the Academy of the Basque Language. XUXEN manages user-lexicons that can be interactively enriched during correction by means of a specially designed human-machine dialogue. It allows the system to acquire the internal features of each new entry (sublexicon, continuation class, and selection marks). Due to a late process of standardization of the language, writers don't always know the standard form to be used and commit errors. The treatment of these 'typical errors' .is made in a specific way by means of describing them using the two-level lexicon system. In this sense, XUXEN is intended as a useful tool for standardization purposes of present day written Basque.