HESML: A scalable ontology-based semantic similarity measures library with a set of reproducible experiments and a replication dataset

Lastra Díaz, Juan José; García Serrano, Ana; Batet, Montserrat; Fernández, Miriam; Chirigati, Fernando

Please use this identifier to cite or link to this item: http://hdl.handle.net/10609/93058

Full metadata record

DC Field	Value	Language
dc.contributor.author	Lastra Díaz, Juan José	-
dc.contributor.author	García Serrano, Ana	-
dc.contributor.author	Batet, Montserrat	-
dc.contributor.author	Fernández, Miriam	-
dc.contributor.author	Chirigati, Fernando	-
dc.contributor.other	Universidad Nacional de Educación a Distancia	-
dc.contributor.other	Open University	-
dc.contributor.other	New York University	-
dc.date.accessioned	2019-04-11T07:54:00Z	-
dc.date.available	2019-04-11T07:54:00Z	-
dc.date.issued	2017-02-21	-
dc.identifier.citation	Lastra Díaz, J.J., García Serrano, A., Batet Sanromà, M., Fernández, M. & Chirigati, F. (2017). HESML: A scalable ontology-based semantic similarity measures library with a set of reproducible experiments and a replication dataset. Information Systems, 66(), 97-118. doi: 10.1016/j.is.2017.02.002	en
dc.identifier.issn	0306-4379MIAR	-
dc.identifier.uri	http://hdl.handle.net/10609/93058	-
dc.description.abstract	This work is a detailed companion reproducibility paper of the methods and experiments proposed by Lastra-Díaz and García-Serrano in (2015, 2016) [56-58], which introduces the following contributions: (1) a new and efficient representation model for taxonomies, called PosetHERep, which is an adaptation of the half-edge data structure commonly used to represent discrete manifolds and planar graphs; (2) a new Java software library called the Half-Edge Semantic Measures Library (HESML) based on PosetHERep, which implements most ontology-based semantic similarity measures and Information Content (IC) models reported in the literature; (3) a set of reproducible experiments on word similarity based on HESML and ReproZip with the aim of exactly reproducing the experimental surveys in the three aforementioned works; (4) a replication framework and dataset, called WNSimRep v1, whose aim is to assist the exact replication of most methods reported in the literature; and finally, (5) a set of scalability and performance benchmarks for semantic measures libraries. PosetHERep and HESML are motivated by several drawbacks in the current semantic measures libraries, especially the performance and scalability, as well as the evaluation of new methods and the replication of most previous methods. The reproducible experiments introduced herein are encouraged by the lack of a set of large, self-contained and easily reproducible experiments with the aim of replicating and confirming previously reported results. Likewise, the WNSimRep v1 dataset is motivated by the discovery of several contradictory results and difficulties in reproducing previously reported methods and experiments. PosetHERep proposes a memory-efficient representation for taxonomies which linearly scales with the size of the taxonomy and provides an efficient implementation of most taxonomy-based algorithms used by the semantic measures and IC models, whilst HESML provides an open framework to aid research into the area by providing a simpler and more efficient software architecture than the current software libraries. Finally, we prove the outperformance of HESML on the state-of-the-art libraries, as well as the possibility of significantly improving their performance and scalability without caching using PosetHERep.	en
dc.language.iso	eng	-
dc.publisher	Information Systems	-
dc.relation.ispartof	Information Systems, 2017, 6	-
dc.relation.uri	https://doi.org/10.1016/j.is.2017.02.002	-
dc.rights	CC BY-NC-ND	-
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/3.0/es/	-
dc.subject	intrinsic and corpus-based Information	en
dc.subject	HESML	ca
dc.subject	HESML	es
dc.subject	HESML	en
dc.subject	PosetHERep	ca
dc.subject	PosetHERep	es
dc.subject	PosetHERep	en
dc.subject	medidas semánticas bibliotecarias	es
dc.subject	mesures semàntiques bibliotecàries	ca
dc.subject	ontology-based semantic similarity	en
dc.subject	measures	en
dc.subject	medidas	es
dc.subject	mesures	ca
dc.subject	content models	en
dc.subject	modelos de contenido	es
dc.subject	models de contingut	ca
dc.subject	similarity	en
dc.subject	similitud	es
dc.subject	similitut	ca
dc.subject	ReproZip	ca
dc.subject	ReproZip	es
dc.subject	ReproZip	en
dc.subject	WNSimRep v1 dataset	ca
dc.subject	WNSimRep v1 dataset	es
dc.subject	WNSimRep v1 dataset	en
dc.subject	reproducible experiments on word	en
dc.subject	experimentos reproducibles con palabras	es
dc.subject	experiments reproduïbles amb paraules	ca
dc.subject	WordNet-based semantic similarity	en
dc.subject	WordNet-basado en similitud semántica	es
dc.subject	WordNet-basat en similitud semàntica	ca
dc.subject	información intrínseca basada en corpus	es
dc.subject	informació intrínseca basada en corpus	ca
dc.subject.lcsh	Ontologies (Information retrieval)	en
dc.title	HESML: A scalable ontology-based semantic similarity measures library with a set of reproducible experiments and a replication dataset	-
dc.type	info:eu-repo/semantics/article	-
dc.subject.lemac	Ontologies (Informàtica)	ca
dc.subject.lcshes	Ontologías (Informática)	es
dc.identifier.doi	10.1016/j.is.2017.02.002	-
dc.gir.id	AR/0000005524	-
dc.relation.projectID	info:eu-repo/grantAgreement/TIN2015-71785-R	-
dc.relation.projectID	info:eu-repo/grantAgreement/S2015/HUM3494	-
dc.type.version	info:eu-repo/semantics/publishedVersion	-
Appears in Collections:	Articles cientÍfics Articles