In the following we want to introduce and briefly explain the current research lines of the NLP & Semantic Computing Group.

Flexible knowledge representation models

In this research line we investigate the complementary aspects between distributional semantics and logical/structured data models, focusing on the analysis of approximate inference on a distributional vector space. While logical models provide an expressive conceptual representation structure with support for inferences and expressive query capabilities, distributional semantics provides a complementary layer where the semantic approximation supported by large-scale comprehensive semantic models and the scalability provided by the vector space model can address the trade-off between expressivity and semantic/terminological flexibility.

The contributions of this research line concentrate on advancing the conceptual and formal work on the interaction between distributional semantics and logic, focusing on the investigation of a distributional deductive inference model for large-scale and heterogeneous knowledge bases. The proposed inference model targets the following features:

  1. an approximative reasoning approach for logical knowledge bases,
  2. the inclusion of large volumes of distributional semantics commonsense knowledge into the inference process and
  3. provision of a principled geometric representation of the inference process.

Reasoning supported by distributional semantics

Building intelligent applications and addressing simple computational semantic tasks demand coping with large-scale commonsense Knowledge Bases. Querying and reasoning over large commonsense KBs are fundamental operations for tasks such as Question Answering, Semantic Search and Knowledge Discovery. However, in an open domain scenario, the scale of KBs and the number of direct and indirect associations between elements in the KBs can make QnR grow unmanageable. To the complexity of querying and reasoning over such large-scale KBs, it is possible to add the barriers involved in building KBs with the necessary consistency and completeness requirements.

Since information completeness of the KBs cannot be guaranteed, one missing fact in the KB would be sufficient to block the reasoning process. Ideally QnR mechanisms should be able to cope with some level of KB incompleteness, approximating and filling the gaps in the KBs. This research line investigates a selective reasoning approach which uses a hybrid distributional-relational semantic model to address the problems previously described. In this work, distributional semantic models are used as complementary semantic layer to the relational model, which supports coping with semantic approximation and incompleteness.

Schema-agnostic queries & distributional-relational models (DRMs)

The evolution of data environments towards the growth in the size, complexity, dynamicity and decentralisation (SCoDD) of schemas drastically impacts contemporary data management. The SCoDD trend emerges as a central data management concern in Big Data scenarios, where users and applications have a demand for more complete data, produced by independent data sources, under different semantic assumptions and contexts of use.

The emergence of this new data environment demands the revisit of the semantic assumptions behind databases and the design of mechanisms which can support semantically heterogeneous databases. This research line aims at filling this gap by proposing a complementary semantic model for databases, based on distributional semantics. Distributional semantics provides a complementary perspective to the formal perspective of database semantics, which supports semantic approximation as a first-class database operation. Differently from models which describe uncertain and incomplete data or probabilistic databases, distributional-relational models focuses on the construction of semantic approximation approaches for databases, supported by a semantic model automatically built from large-scale unstructured data external to the database, which serves as a semantic/commonsense knowledge base. The semantic abstraction can be used to abstract the database user from the representation of the data, supporting a schema-agnostic approach to data consumption.

Distributional semantics systems and resources

Distributional semantic models (DSMs) are semantic models which are based on the statistical analysis of co-occurrences of words in large corpora. DSMs can be used in a wide spectrum of semantic applications including semantic search, question answering, paraphrase detection, word sense disambiguation, among others. The ability to automatically harvest meaning from unstructured heterogeneous data, its simplicity of use and the ability to build comprehensive semantic models are major strengths of distributional approaches.

The construction of distributional semantic models, is dependent on the processing of large-scale data resources. The English version of Wikipedia 2014, for example, contains 44 GB of article data. The hardware and software infrastructure requirements necessary to process large-scale corpora bring high entry barriers for researchers and developers to start experimenting with distributional semantics.

In order to reduce these barriers this research line focuses on the development of fundamental distributional research infrastructures, commoditizing the access to distributional semantic resources. The infrastructure consists of software, data and service resources which could be easily reused and re-deployed by third parties.