3 Lb: Building A Syntactic Semantic Trees Based Database Cess Ece Syntactically And Semantically Annotated Corpora (Spanish, Catalan, Basque) Know2: Language Understanding Technologies For Multilingual Domain Oriented Information... Lang2 World: Discovering The World Knowledge Codified In The Language Metanet4 U – Enhancing The European Linguistic Infrastructure PANACEA Praxem, Semantic And Pragmatic Annotation Of The Cess Ece Corpus Sentence Semantics: Creación De Una Base De Datos De Semántica Oracional

Corpora by Funding project

3 Lb: Building A Syntactic Semantic Trees Based Database

An Cora Es

The AnCora-ES is a Spanish corpus annotated at different levels: Lemma and Part of Speech; Syntactic constituents and functions; Argument structure and thematic roles; Semantic classes of the verb; Denotative type of deverbal nouns; Nouns related to WordNet synsets; Named Entities and C... more
An Cora Ca

The AnCora-CA is a Catalan corpus annotated at different levels: Lemma and Part of Speech; Syntactic constituents and functions; Argument structure and thematic roles; Semantic classes of the verb; Denotative type of deverbal nouns; Nouns related to WordNet synsets; Named Entities and C... more

Cess Ece Syntactically And Semantically Annotated Corpora (Spanish, Catalan, Basque)

An Cora Dep Ca

AnCora-DEP-CA is the AnCora-Es multilevel annotated corpus of Catalan in dependency-based representation, consisting of 500,000 words approximately. AnCora-DEP-Es can be used as source of information for inducing grammars, developing, improving and/or evaluating syntactic parsers and al... more
An Cora Ca

The AnCora-CA is a Catalan corpus annotated at different levels: Lemma and Part of Speech; Syntactic constituents and functions; Argument structure and thematic roles; Semantic classes of the verb; Denotative type of deverbal nouns; Nouns related to WordNet synsets; Named Entities and C... more
An Cora Es

The AnCora-ES is a Spanish corpus annotated at different levels: Lemma and Part of Speech; Syntactic constituents and functions; Argument structure and thematic roles; Semantic classes of the verb; Denotative type of deverbal nouns; Nouns related to WordNet synsets; Named Entities and C... more
GrAF Version Of The Basque Dependency Treebank

This is the stand-off GrAF version of the Basque Dependency Treebank (BDT). It is the Reference Corpus for the Processing of Basque (EPEC) annotated at syntactic level. EPEC is a 300,000 word corpus of standard written journal texts which aims to be a training corpus for the development... more
An Cora Dep Es

AnCora-DEP-Es is the AnCora-Es multilevel annotated corpus of Spanish in dependency-based representation, consisting of 500,000 words approximately. AnCora-DEP-Es can be used as source of information for inducing grammars, developing, improving and/or evaluating syntactic parsers and al... more

Know2: Language Understanding Technologies For Multilingual Domain Oriented Information Access

GrAF Version Of Catalan Portions Of Wikipedia Corpus

This is the stand-off GrAF version of Catalan portions of the Wikipedia (based on a 2006 dump). This Wikipedia Catalan Corpus contains 122052 articles that contain about 47,3 million words in raw text format. It has been cleaned by erasing disambiguation pages, removing some XML tags an... more
GrAF Version Of Spanish Portions Of Wikipedia Corpus

This is the stand-off GrAF version of Spanish portions of the Wikipedia (based on a 2006 dump). This Wikipedia Spanish Corpus contains 257019 articles that contain about 150,1 million words in raw text format. It has been cleaned by erasing disambiguation pages, removing some XML tags a... more

Lang2 World: Discovering The World Knowledge Codified In The Language

An Cora Dep Es

AnCora-DEP-Es is the AnCora-Es multilevel annotated corpus of Spanish in dependency-based representation, consisting of 500,000 words approximately. AnCora-DEP-Es can be used as source of information for inducing grammars, developing, improving and/or evaluating syntactic parsers and al... more
An Cora Ca

The AnCora-CA is a Catalan corpus annotated at different levels: Lemma and Part of Speech; Syntactic constituents and functions; Argument structure and thematic roles; Semantic classes of the verb; Denotative type of deverbal nouns; Nouns related to WordNet synsets; Named Entities and C... more
An Cora Dep Ca

AnCora-DEP-CA is the AnCora-Es multilevel annotated corpus of Catalan in dependency-based representation, consisting of 500,000 words approximately. AnCora-DEP-Es can be used as source of information for inducing grammars, developing, improving and/or evaluating syntactic parsers and al... more
An Cora Es

The AnCora-ES is a Spanish corpus annotated at different levels: Lemma and Part of Speech; Syntactic constituents and functions; Argument structure and thematic roles; Semantic classes of the verb; Denotative type of deverbal nouns; Nouns related to WordNet synsets; Named Entities and C... more

Metanet4 U – Enhancing The European Linguistic Infrastructure

Corpus92 Corpus

The corpus consists of a number of texts corresponding to Access to University examinations held on June 1992 in several Spanish universities. It contains about 350,000 words in 3 documents.
Genomics IULA Corpus In Catalan

The corpus consists of a number of specialized texts of Genome domain. This is LSP corpus has been created with articles from specialized publications, PhD theses, etc. It contains about 950 K words in 134 documents.
IULA Spanish Lsp Treebank

This treebank consists of a number of sentences syntactically analyzed. The sentences have been choosed from the IULA LSP corpus, automatically annotated with POS information and manually annotated with syntactical information using the DELPH-IN environment. The resulting syntactic anal... more
IULA Spanish English Technical Corpus

The corpus consists of a number of specialized texts (Law, Economics, Medicine, Environment and Computer Science domains) available in both Spanish and English languages. This LSP corpus has been compiled with articles from specialized publications, PhD theses, etc. It contains about a ... more
GrAF Version Of The Basque Dependency Treebank

This is the stand-off GrAF version of the Basque Dependency Treebank (BDT). It is the Reference Corpus for the Processing of Basque (EPEC) annotated at syntactic level. EPEC is a 300,000 word corpus of standard written journal texts which aims to be a training corpus for the development... more
GrAF Version Of The Sen Sem Spanish Corpus

This is the stand-off GrAF version of the SenSem Spanish Corpus. The original SenSem Spanish Corpus includes syntactic and semantic annotations for a number of Spanish texts from the press domain developed by the GRIAL group (Grup de recerca consolidat de la Generalitat de Catalunya). T... more
GrAF Version Of Spanish Portions Of Wikipedia Corpus

This is the stand-off GrAF version of Spanish portions of the Wikipedia (based on a 2006 dump). This Wikipedia Spanish Corpus contains 257019 articles that contain about 150,1 million words in raw text format. It has been cleaned by erasing disambiguation pages, removing some XML tags a... more
Tibidabo Treebank And IULA Spanish Lsp Treebank Train And Test Partitions

This package contains a partition of the Iula Spanish LSP Treebank into train and test sets to perform Machine Learning experiments. In that way the same partitions can be used by different researchers and their results can be directly compared. In this package we also deliver the Tibid... more
Genomics IULA Corpus In Spanish

The corpus consists of a number of specialized texts of Genome domain. This is LSP corpus has been created with articles from specialized publications, PhD theses, etc. It contains about 1,650 K words in 276 documents.
GrAF Version Of Catalan Portions Of Wikipedia Corpus

This is the stand-off GrAF version of Catalan portions of the Wikipedia (based on a 2006 dump). This Wikipedia Catalan Corpus contains 122052 articles that contain about 47,3 million words in raw text format. It has been cleaned by erasing disambiguation pages, removing some XML tags an... more
IULA Penn Treebank

This treebank consists of a number of Spanish and English sentences that has been manually annotated with syntactical information. The sentences have been choosed from the Penn TreeBank corpus, a resource containing texts from Wall Street Journal and originally compiled by the Universit... more

PANACEA

PANACEA Annotated Dependency Spanish Environment Corpus Version 2

PANACEA Annotated Spanish Environment Corpus Version 2 consists of Spanish texts in the Environment (ENV) domain that were collected and automatically annotated in the framework of PANACEA (http://www.panacea-lr.eu), an EU-FP7 Funded Project under Grant Agreement 248064. The texts were ... more
PANACEA Environment Corpus N Grams En (English)

This data set contains English word n-grams and English word/tag/lemma n-grams in the Environment (ENV) domain. N-grams are accompanied by their observed frequency counts. The length of the n-grams ranges from unigrams (single words) to five-grams. The data were collected in the context... more
PANACEA Annotated Dependency Greek Labour Legislation Corpus Version 2

PANACEA Annotated Greek Labour Legislation Corpus Version 2 consists of Greek texts in the Labour Legislation (LAB) domain that were collected and automatically annotated in the framework of PANACEA (http://www.panacea-lr.eu), an EU-FP7 Funded Project under Grant Agreement 248064. The t... more
PANACEA Environment Corpus N Grams Fr (French)

This data set contains French word n-grams and French word/tag/lemma n-grams in the Environment (ENV) domain. N-grams are accompanied by their observed frequency counts. The length of the n-grams ranges from unigrams (single words) to five-grams. The data were collected in the context o... more
PANACEA Annotated Dependency Spanish Labour Legislation Corpus Version 2

PANACEA Annotated Spanish Labour Legislation Corpus Version 2 consists of Spanish texts in the Labour legislation (LAB) domain that were collected and automatically annotated in the framework of PANACEA (http://www.panacea-lr.eu), an EU-FP7 Funded Project under Grant Agreement 248064. T... more
PANACEA Labour Legislation Corpus N Grams Es (Spanish)

This data set contains Spanish word n-grams and Spanish word/tag/lemma n-grams in the LABOUR Legislation (LAB) domain. N-grams are accompanied by their observed frequency counts. The length of the n-grams ranges from unigrams (single words) to five-grams. The data were collected in the ... more
PANACEA Environment Corpus N Grams It (Italian)

This data set contains Italian word n-grams and Italian word/tag/lemma n-grams in the Environment (ENV) domain. N-grams are accompanied by their observed frequency counts. The length of the n-grams ranges from unigrams (single words) to five-grams. The data were collected in the context... more
PANACEA Labour Legislation Corpus N Grams It (Italian)

This data set contains Italian word n-grams and Italian word/tag/lemma n-grams in the Labour (LAB) domain. N-grams are accompanied by their observed frequency counts. The length of the n-grams ranges from unigrams (single words) to five-grams. The data were collected in the context of P... more
PANACEA Labour Legislation Corpus N Grams En (English)

This data set contains English word n-grams and English word/tag/lemma n-grams in the labour Legislation (LAB) domain. N-grams are accompanied by their observed frequency counts. The length of the n-grams ranges from unigrams (single words) to five-grams. The data were collected in the ... more
PANACEA Annotated Dependency Italian Labour Legislation Corpus Version 2

PANACEA Annotated Italian Labour Legislation Corpus Version 2 consists of Italian texts in the Labour Legislation (LAB) domain that were collected and automatically annotated in the framework of PANACEA (http://www.panacea-lr.eu), an EU-FP7 Funded Project under Grant Agreement 248064. T... more
PANACEA Labour Legislation Corpus N Grams Fr (French)

This data set contains French word n-grams and French word/tag/lemma n-grams in the Labour (LAB) domain. N-grams are accompanied by their observed frequency counts. The length of the n-grams ranges from unigrams (single words) to five-grams. The data were collected in the context of PAN... more
PANACEA Environment Corpus N Grams Es (Spanish)

This data set contains Spanish word n-grams and Spanish word/tag/lemma n-grams in the Environment (ENV) domain. N-grams are accompanied by their observed frequency counts. The length of the n-grams ranges from unigrams (single words) to five-grams. The data were collected in the context... more
PANACEA Annotated Dependency Greek Environment Corpus Version 2

PANACEA Annotated Greek Environment Corpus Version 2 consists of Greek texts in the Environment (ENV) domain that were collected and automatically annotated in the framework of PANACEA (http://www.panacea-lr.eu), an EU-FP7 Funded Project under Grant Agreement 248064. The texts were craw... more
PANACEA Annotated Dependency Italian Environment Corpus Version 2

PANACEA Annotated Italian Environment Corpus Version 2 consists of Italian texts in the Environment (ENV) domain that were collected and automatically annotated in the framework of PANACEA (http://www.panacea-lr.eu), an EU-FP7 Funded Project under Grant Agreement 248064. The texts were ... more

Praxem, Semantic And Pragmatic Annotation Of The Cess Ece Corpus

An Cora Ca

The AnCora-CA is a Catalan corpus annotated at different levels: Lemma and Part of Speech; Syntactic constituents and functions; Argument structure and thematic roles; Semantic classes of the verb; Denotative type of deverbal nouns; Nouns related to WordNet synsets; Named Entities and C... more
An Cora Es

The AnCora-ES is a Spanish corpus annotated at different levels: Lemma and Part of Speech; Syntactic constituents and functions; Argument structure and thematic roles; Semantic classes of the verb; Denotative type of deverbal nouns; Nouns related to WordNet synsets; Named Entities and C... more

Sentence Semantics: Creación De Una Base De Datos De Semántica Oracional

GrAF Version Of The Sen Sem Spanish Corpus

This is the stand-off GrAF version of the SenSem Spanish Corpus. The original SenSem Spanish Corpus includes syntactic and semantic annotations for a number of Spanish texts from the press domain developed by the GRIAL group (Grup de recerca consolidat de la Generalitat de Catalunya). T... more