Corpora

A corpus is a collection of writings, conversations, speeches, etc., that people use to study and describe a language [Merriam-Webster]

Corpora by Annotation type

Morphosyntactic Annotation Pos Tagging

  • An Cora Ca

    The AnCora-CA is a Catalan corpus annotated at different levels: Lemma and Part of Speech; Syntactic constituents and functions; Argument structure and thematic roles; Semantic classes of the verb; Denotative type of deverbal nouns; Nouns related to WordNet synsets; Named Entities and C... more

  • An Cora Es

    The AnCora-ES is a Spanish corpus annotated at different levels: Lemma and Part of Speech; Syntactic constituents and functions; Argument structure and thematic roles; Semantic classes of the verb; Denotative type of deverbal nouns; Nouns related to WordNet synsets; Named Entities and C... more

  • Genomics IULA Corpus In Spanish

    The corpus consists of a number of specialized texts of Genome domain. This is LSP corpus has been created with articles from specialized publications, PhD theses, etc. It contains about 1,650 K words in 276 documents.

  • An Cora Dep Es

    AnCora-DEP-Es is the AnCora-Es multilevel annotated corpus of Spanish in dependency-based representation, consisting of 500,000 words approximately. AnCora-DEP-Es can be used as source of information for inducing grammars, developing, improving and/or evaluating syntactic parsers and al... more

  • Corpus92 Corpus

    The corpus consists of a number of texts corresponding to Access to University examinations held on June 1992 in several Spanish universities. It contains about 350,000 words in 3 documents.

  • IULA Spanish English Technical Corpus

    The corpus consists of a number of specialized texts (Law, Economics, Medicine, Environment and Computer Science domains) available in both Spanish and English languages. This LSP corpus has been compiled with articles from specialized publications, PhD theses, etc. It contains about a ... more

  • GrAF Version Of Spanish Portions Of Wikipedia Corpus

    This is the stand-off GrAF version of Spanish portions of the Wikipedia (based on a 2006 dump). This Wikipedia Spanish Corpus contains 257019 articles that contain about 150,1 million words in raw text format. It has been cleaned by erasing disambiguation pages, removing some XML tags a... more

  • An Cora Dep Ca

    AnCora-DEP-CA is the AnCora-Es multilevel annotated corpus of Catalan in dependency-based representation, consisting of 500,000 words approximately. AnCora-DEP-Es can be used as source of information for inducing grammars, developing, improving and/or evaluating syntactic parsers and al... more

  • GrAF Version Of Catalan Portions Of Wikipedia Corpus

    This is the stand-off GrAF version of Catalan portions of the Wikipedia (based on a 2006 dump). This Wikipedia Catalan Corpus contains 122052 articles that contain about 47,3 million words in raw text format. It has been cleaned by erasing disambiguation pages, removing some XML tags an... more

  • Genomics IULA Corpus In Catalan

    The corpus consists of a number of specialized texts of Genome domain. This is LSP corpus has been created with articles from specialized publications, PhD theses, etc. It contains about 950 K words in 134 documents.

Back to top

Semantic Annotation: Semantic Roles

  • GrAF Version Of The Sen Sem Spanish Corpus

    This is the stand-off GrAF version of the SenSem Spanish Corpus. The original SenSem Spanish Corpus includes syntactic and semantic annotations for a number of Spanish texts from the press domain developed by the GRIAL group (Grup de recerca consolidat de la Generalitat de Catalunya). T... more

  • An Cora Es

    The AnCora-ES is a Spanish corpus annotated at different levels: Lemma and Part of Speech; Syntactic constituents and functions; Argument structure and thematic roles; Semantic classes of the verb; Denotative type of deverbal nouns; Nouns related to WordNet synsets; Named Entities and C... more

  • An Cora Ca

    The AnCora-CA is a Catalan corpus annotated at different levels: Lemma and Part of Speech; Syntactic constituents and functions; Argument structure and thematic roles; Semantic classes of the verb; Denotative type of deverbal nouns; Nouns related to WordNet synsets; Named Entities and C... more

Back to top

Semantic Annotation: Word Senses

  • An Cora Es

    The AnCora-ES is a Spanish corpus annotated at different levels: Lemma and Part of Speech; Syntactic constituents and functions; Argument structure and thematic roles; Semantic classes of the verb; Denotative type of deverbal nouns; Nouns related to WordNet synsets; Named Entities and C... more

  • An Cora Ca

    The AnCora-CA is a Catalan corpus annotated at different levels: Lemma and Part of Speech; Syntactic constituents and functions; Argument structure and thematic roles; Semantic classes of the verb; Denotative type of deverbal nouns; Nouns related to WordNet synsets; Named Entities and C... more

Back to top

Syntactic Annotation: Shallow Parsing

  • An Cora Es

    The AnCora-ES is a Spanish corpus annotated at different levels: Lemma and Part of Speech; Syntactic constituents and functions; Argument structure and thematic roles; Semantic classes of the verb; Denotative type of deverbal nouns; Nouns related to WordNet synsets; Named Entities and C... more

  • GrAF Version Of The Sen Sem Spanish Corpus

    This is the stand-off GrAF version of the SenSem Spanish Corpus. The original SenSem Spanish Corpus includes syntactic and semantic annotations for a number of Spanish texts from the press domain developed by the GRIAL group (Grup de recerca consolidat de la Generalitat de Catalunya). T... more

  • An Cora Ca

    The AnCora-CA is a Catalan corpus annotated at different levels: Lemma and Part of Speech; Syntactic constituents and functions; Argument structure and thematic roles; Semantic classes of the verb; Denotative type of deverbal nouns; Nouns related to WordNet synsets; Named Entities and C... more

Back to top

Syntactic Annotation: Treebanks

  • An Cora Dep Es

    AnCora-DEP-Es is the AnCora-Es multilevel annotated corpus of Spanish in dependency-based representation, consisting of 500,000 words approximately. AnCora-DEP-Es can be used as source of information for inducing grammars, developing, improving and/or evaluating syntactic parsers and al... more

  • An Cora Dep Ca

    AnCora-DEP-CA is the AnCora-Es multilevel annotated corpus of Catalan in dependency-based representation, consisting of 500,000 words approximately. AnCora-DEP-Es can be used as source of information for inducing grammars, developing, improving and/or evaluating syntactic parsers and al... more

  • IULA Spanish Lsp Treebank

    This treebank consists of a number of sentences syntactically analyzed. The sentences have been choosed from the IULA LSP corpus, automatically annotated with POS information and manually annotated with syntactical information using the DELPH-IN environment. The resulting syntactic anal... more

  • IULA Penn Treebank

    This treebank consists of a number of Spanish and English sentences that has been manually annotated with syntactical information. The sentences have been choosed from the Penn TreeBank corpus, a resource containing texts from Wall Street Journal and originally compiled by the Universit... more

  • GrAF Version Of The Basque Dependency Treebank

    This is the stand-off GrAF version of the Basque Dependency Treebank (BDT). It is the Reference Corpus for the Processing of Basque (EPEC) annotated at syntactic level. EPEC is a 300,000 word corpus of standard written journal texts which aims to be a training corpus for the development... more

  • Tibidabo Treebank And IULA Spanish Lsp Treebank Train And Test Partitions

    This package contains a partition of the Iula Spanish LSP Treebank into train and test sets to perform Machine Learning experiments. In that way the same partitions can be used by different researchers and their results can be directly compared. In this package we also deliver the Tibid... more

Back to top

Alignment

  • Cluvi Parallel Corpus

    The CLUVI Corpus of the University of Vigo is an open collection of parallel text corpora developed under the direction of Xavier Gómez Guinovart (2003-2012) that covers specific areas of the contemporary Galician language. With 23 million words, the CLUVI Corpus comprises six main para... more

Back to top