Projects

3LB: Building a syntactic-semantic-trees-based database The main objective of 3LB project is to build three treebanks (syntactically annotated corpus) for Spanish, Catalan and Basque. Besides the syntactic annotation, it will be carried out a semantic annotation by means of the synsets of the different wordnets (http://www.cogsci.princeton.edu/~wn/w3wn.html) built for each language, as well as an annotation of anaphoric and elliptic elements just as the co-reference. Corpus extension for Spanish and Catalan will be 100.000 words, and 50.000 for Basque due to more notational complexity and smaller covering of its wordnet (35.000 entries instead of 100.000 for the Spanish and 65.000 for the Catalan).
Apertium Since 2004, the Apertium project develops a free/open-source machine translation platform, initially aimed at related-language pairs but expanded to deal with more divergent language pairs. The mission of the Apertium project is to collaboratively develop free/open-source machine translation for as many languages as possible, and in particular: - To give everyone free, unlimited access to the best possible machine-translation technologies. - To maintain a modular, documented, open platform for machine translation and other human language processing tasks - To favour the interchange and reuse of existing linguistic data. - To make integration with other free/open-source technologies easier. - To radically guarantee the reproducibility of machine translation and natural language processing research.
Apertium Trautorom Apertium Trautorom is a machine translation project linked to the Apertium project. This project develops the Romanian-Spanish language pair.
CESS-ECE Syntactically and Semantically Annotated Corpora (Spanish, Catalan, Basque) Syntactically annotated databank composed of constituents and functions of 500.000 words for Catalan, Spanish, and Basque.
CLARIN ERIC CLARIN is one of the Research Infrastructures that were selected for the European Research Infrastructures Roadmap by ESFRI, the European Strategy Forum on Research Infrastructures. It is a distributed data infrastructure, with sites all over Europe. Typical sites are universities, research institutions, libraries and public archives. They all have in common that they provide access to digital language data collections, to digital tools to work with them, and to expertise for researchers to work with them. The CLARIN Governance and Coordination body at the European level is CLARIN ERIC. An ERIC is a new type of international legal entity, established by the European Commission in 2009. Its members are governments or intergovernmental organisations.
CLARIN, Common Language Resources and Technologies (Spain) CLARIN has the goal of facilitate access to collections of linguistic data (texts, multimedia recordings, dictionaries, etc.) and make possible the use in the net of analysis and exploitation tools of these data based on language technologies, specially for the research in Humanities and Social Sciences.This project was funded by European Union (FP7-INFRASTRUCTURES-2007-1-212230) and by the Ministry of Education and Science (CAC-2007-23). Duration of the project: 2007-2008.
Chartex This project focuses on the extraction of information about places, people and events in their lives from medieval charters by using NLP technologies as NER, among others.
Connecting Historical Authorities with Linked Data, Contexts and Entities This project aimed at providing a historic place-name gazetteer covering a thousand years of history and providing links to attestations in old texts and maps', NER techniques were used to extract new data from digitized English Place -Name Survey.
Digging Into the Enlightenment: Mapping the Republic of Letters This collaborative project analyzed the degree to which the effects of the Enlightenment can be observed in the writing of people of various occupations in a corpus of 53,000 18th-century letters called Electronic Enlightenment (EE). The hypothesis of the project presents a new perspective in the practice of interpretative research in the field of humanities. This view aimed to integrate innovative visualization and annotation techniques into interactive tools for exploring and analyzing information about people, places, times, and relationships in the 'Republic of Letters'.
Digging by Debating: linking massive datasets to specific arguments Project's goal is to develop ways of searching and visualizing the interactions between humanities and sciences based on models of argumentation by means of tool generation. The research behind this project looks to establish what volume of collections of text (such as HathiTrust/GoogleBooks collection) can support the kinds of detailed argumentative analysis that is central to scholarly research and public discussion of science and humanities. It suggests: linking contact points between philosophy and the sciences; topic modeling to identify rich content in a chosen topic; identifying and mapping key arguments by novel analysis framework for propositions and arguments, and sentence modeling to get back to HathiTrust materials.
Dynamic Variorum Editions The project's goal was to identify and track topics about the Greco-Roman world as they appear in multilingual public document collections (Internet Archive, JSTOR, HathiTrust, etc.). This project aimed to create an environment and to generate 'dynamic variorum' editions of texts based on a services infrastructure. Project's work included mining primary and secondary digital sources to locate where people and places from the Greco-Roman world are discussed, what Greek and Latin works are cited, and what has been said about the people, places, and texts of the Greco-Roman world over time.
Eaqua The project "aims at generating specific knowledge from ancient texts and will provide this knowledge via an open web-portal to the scientific community for future empirical studies. For this purpose researches from the fields of Computer Science and Ancient Science will cooperate to adapt the available text mining technologies to the needs and requirements of the Ancient Studies”.
Etraces The project aims to detect “temporal traces and interconnecting relations of text passages in German language novels from 1500 and 1990, as well as social science texts created since 1909”.
EurOpentrad EurOpenTrad was a machine translation project. The aim of this project was to develop a machine translation system between English and the official languages of the Spanish State by developing hybrid systems developed in OpenTrad for the purpose.
FAUST - Feedback Analysis for User adaptive Statistical Translation The objective of this project was to develop interactive machine translation systems which adapt rapidly and intelligently in response to user feedback. The research was based on translation in five bidirectional language pairs in these EU official languages: Czech-English ; French-English ; Romanian-English ; Spanish-English ; Spanish-Catalan.
IULA-UPF CLARIN Centre of Competence The project aim’s is to create the IULA-UPF CLARIN Centre of Competence for promoting and counseling about utilization of language technologies, analysis and exploitation tools, and access to digitized data, in particular for the researchers in Humanities and Social Sciences. This project is co-funded by the "Fons europeu de desenvolupament regional (FEDER), Programa operatiu FEDER de Catalunya 2007-2013, Objective 1"@en and Universitat Pompeu Fabra.
Integrated Social History Environment for Research This project is researching the application of tools to detect, link and visualize events, trends, people, organizations, and other entities of interest to social history. Having text-mining-based rich semantic metadata extraction for collections' indexing, clustering and classification as its main focus, the project aims at reducing the manual costs currently involved in such activities.
JAZZ The project focuses on the problem of current heterogeneity of language data intended for linguistic research. The result of the project will be a unified system for storing and using language resources together with robust tools enabling effective text processing. All the available language resources will be converted into the new system. The project is concerned also with detection and classification of "named entities" in Czech texts, a subject not yet resolved for the Czech language. Its inclusion into the unified data system will improve results of automatic language processing, especially in the field of information retrieval from large text databases.
KNOW2: Language understanding technologies for multilingual domain-oriented information access The aim of KNOW2 was to obtain better performance of multilingual information access by using two main strategies: (i) moving from general to specific domains and (ii) incorporating text-mining and collaborative interfaces.
LIDER Project LIDER is a FP7 project. The project’s mission is to provide the basis for the creation of a Linguistic Linked Data cloud that can support content analytics tasks of unstructured multilingual cross-media content. By achieving this goal, LIDER will impact on the ease and efficiency with which Linguistic Linked Data will be exploited in content analytics processes
Lang2World: Discovering the world knowledge codified in the language The objectives of this project were a) Progress and improvement of knowledge-based NLP tools and resources for syntax and semantics with applications to HLT tasks, b) Development of linguistic knowledge-based methods, applied to the identification of relevant entities in the Text Mining framework, namely, extended Named Entity recognition and classification and terminology extraction from restricted domains, c) Identification and formalization of the linguistic knowledge necessary for pattern extraction and paraphrase machine learning tasks and the detection of given and new information, and d) Development of annotated corpora, at several levels of analysis and linguistically-based, for their application in machine learning techniques in supervised systems and for the tasks in the project.
META-NET META-NET is a Network of Excellence dedicated to fostering the technological foundations of a multilingual European information society. Language Technologies will: enable communication and cooperation across languages,secure users of any language equal access to information and knowledge, build upon and advance functionalities of networked information technology. A concerted, substantial, continent-wide effort in language technology research and engineering is needed for realising applications that enable automatic translation, multilingual information and knowledge management and content production across all European languages. This effort will also enhance the development of intuitive language-based interfaces to technology ranging from household electronics, machinery and vehicles to computers and robots.
META-NORD The META-NORD project aims to establish an open linguistic infrastructure in the Baltic and Nordic countries to serve the needs of the industry and research communities. The project will focus on 8 European languages - Danish, Estonian, Finnish, Icelandic, Latvian, Lithuanian, Norwegian and Swedish - that each have less than 10 million speakers. The project will assemble, link across languages, and make widely available language resources of different types used by different categories of user communities in academia and industry to create products and applications that facilitate linguistic diversity in the EU.
METANET4U – Enhancing the European Linguistic Infrastructure The central objective of the METANET4U project is to contribute to the establishment of a pan-European digital platform that makes available language resources and services, encompassing both datasets and software tools, for speech and language processing, and supports a new generation of exchange facilities for them.
Mapping Texts The project aimed to develop new ways of discovering and analyzing language patterns embedded in historical newspaper databases. Its approach was to combine text mining and geospatial visualization methods to explore massive collections of electronic texts.
Mapping the Republic of Letters The project's goal was to visualize certain correspondence networks from digital scholarly editions of historical letters as a way of exploring a bundle of historical questions about the geographic range, diversity, and interactions among intellectuals during seventeenth, eighteenth, and nineteenth centuries.
OPENMT-2: Hibrid Machine Translation and advanced evaluation The objective of this project was to develop a system through which Basque Wikipedia contributors collaborate with the University of the Basque Country (UPV/EHU) in order to generate new Wikipedia content and improve a previous machine translation system.
Opentrad Opentrad was an automatic translation project based on open-coded syntactic transference and which was valid for Spanish, Galician, Catalan/Valencian and Basque. Opentrad enabled the translation of texts or documents as well as the navigation on webpages while translating at the same time. The language pairs that already have an operating prototype are: Spanish <-> Catalan/Valencian, Spanish <-> Galician, and Spanish <-> Basque (the system only translates from Spanish to Basque, not vice-versa).
PANACEA PANACEA (Platform for Automatic, Normalised Annotation and Cost-Effective Acquisition of Language Resources for Human Languages Technologies). Its main goal is to develop technologies for the automation of all the stages involved in the acquisition, production, updating, validation and maintenance of Linguistic Technologies and Resources. The project coordinated by our group, counts on the participation of Cambridge University, the Istituto di Linguistica Computazionale, Italy, the Institute for Language and Speech Processing, Greece, Dublin City University, Ireland, and two companies, the German Linguatec and the French ELDA, Evaluation and Language Resources Distribution Agency. This project was funded by the Language Technologies Area, Information and Communication Technologies, of the 7th Framework Programme (7FP-ITC-248064), of the EU 7th Framework Programme. Duration of the project: 2010-2012.
PAROLE (LE2-4017) The aim of the PAROLE project was the compilation of large, generic and re-usable Written Language Resources for all European Languages, comprising more specifically: i) General language text corpora of the size of 20,000,000 words in 14 languages (Belgian French, Catalan, Danish, Dutch, English, French, Finnish, German, Greek, Irish, Italian, Norwegian, Portuguese and Swedish), and ii) computational lexicons with 20,000 lemmas in 12 languages (Catalan, Danish, Dutch, English, Finnish, French, German, Greek, Italian, Portuguese, Spanish, Swedish). The value of these resources lies not only in the size and number of languages covered by the project, but also in the fact that they are built according to common standards and specifications.
Praxem, semantic and pragmatic annotation of the CESS-ECE corpus This project was as objective CESS-ECE corpus annotation with pragmatics and semantics information.
Project Arclight: Analytics for the Study of 20th Century Media A collaboration among interdisciplinary researchers at Concordia University and the University of Wisconsin-Madison, Project Arclight is developing a web-based tool that enables the study of 20th century American media through comparisons across time and space. The Arclight tool uses topic modeling, named entity recognition, and ranking algorithms to analyze roughly two million pages of public domain publications digitized by the Media History Digital Library and the Chronicling America Library of Congress National Digital Newspaper Program. The results are presented as data visualizations that users may click through to access the underlying sources.
Trading Consequences This project is a multi-institutional, international collaboration between environmental historians in Canada and computer scientists in the UK that uses text-mining software to explore thousands of pages of historical documents related to international commodity trading in the British Empire, involving Canada in particular, during the 19th century, and its impact on the economy and environment.
Viral Texts: Mapping Networks of Reprinting in 19th Century Newspapers and Magazines The project aims to improve the search on a Nineteenth-Century American Newspapers corpus using and developing data mining tools. It is working spaces-efficient n-gram indexing to identify candidate newspapers and then exploits local models of alignment to identify reprinted fragments unknown a priori.
XLike: Cross-lingual Knowledge Extraction The goal of the XLike project was to develop technology to monitor and aggregate knowledge spreaded across mainstream and social media, and to enable cross-lingual services for publishers, media monitoring and business intelligence.
Androcentrismo en la prensa española: ¿de quién hablan las noticias? This project investigates androcentric practices in Spanish general press since late 80s. CLARIN contributed in the research by providing services and tools to run an experiment that automatically analised a corpus of Spanish press since 2002. 150,000 newspaper headlines were analysed. The process included: a) automatically parse over 150,000 news headlines, b) identifying subjects of sentences using automatic annotation tools and (iii) semantic classification to calculate the presence of human subjects versus abstract subjects in Spanish newspaper headlines.
Sentence Semantics: Creación de una Base de Datos de Semántica Oracional This project aimed to build a databank of Spanish verbs based on a lexicon that links each verb sense to a significant number of manually analyzed corpus examples. This databank will reflect the syntactic and semantic behavior of Spanish verbs in naturally occurring text.
Municipals’11 online This is an e-communication and e-politics project that analyses the 2011 municipal election campaign in Spain. The main objective of the project is to analyse new tendencies in e-politics and to demonstrate the impact of internet on the electoral process. CLARIN contributed to the research by providing services and tools to run an experiment that automatically analysed published texts in the Catalan political blogosphere related to this election campaign. The process included: a) automatically and periodically access to 459 representative blogs in the Catalan political blogosphere, b) obtaining and pre-processing of the different texts (over 8000 posts) and (iii) texts processing with different statistical tools to calculate the most significant words for each blog in contrast to the others.
Universitat Politècnica de Catalunya. Càtedra de Programari Lliure Due to the growing interest shown by society and businesses in free software, and in keeping with UPC’s unfaltering commitment to this technology, UPC’s Board of Management launched the Free Software Chair to serve as a vehicle for all the initiatives undertaken by the University in this area. The Chair, which was created in September 2004, aims to raise awareness of open-source software at UPC and in society at large. At UPC, the Chair aims to promote the use and development of open-source software, both amongst the teaching and research staff and administrative and service staff as well as amongst the student body. Beyond the university community, the Chair aims to channel the needs and interests of companies, government and society towards UPC’s team of experts, who will be able to offer them solutions crafted around free software.