标记培训数据越来越成为部署机器学习系统的最大瓶颈。我们展示了Snorkel,这是首个同类系统,使用户无需手工标记任何培训数据即可培训最先进的模型。相反,用户编写表示任意启发式的标签函数,其可能具有未知的准确性和相关性。通过整合我们最近提出的最新机器学习范例 – 数据编程的第一个端到端实现,浮潜可以减少他们的输出而无需访问基本事实。我们根据过去一年与公司,机构和研究实验室合作的经验,提出了一个灵活的界面层来撰写标签功能。在用户研究中,主题专家建立的模型速度提高2.8倍,平均预测性能提高45倍。5%与7个小时的手工标签。我们在这个新设置中研究了建模折衷方案,并提出了一个优化器,用于实现折衷决策的自动化,从而可以实现每管线执行速度高达1.8倍的加速。在与美国退伍军人事务部和美国食品和药物管理局的两项合作中,以及代表其他部署的四个开源文本和图像数据集合,Snorkel提供的平均预测性能比先前的启发式方法提高了132%在大型手工培训集的预测性能平均3.60%之内。PVLDB参考格式:A. Ratner,SH Bach,H. Ehrenberg,J. Fries,S. Wu,C. R’e。浮潜:快速培训数据创建与弱监督。PVLDB,11(3):xxxx-yyyy,2017. DOI:10.14778 / 3157794.3157797 我们在这个新环境中研究建模折衷方案,并提出一个优化器来实现折衷决策的自动化,每个管道执行速度最高可达1.8倍。在与美国退伍军人事务部和美国食品和药物管理局的两项合作中,以及代表其他部署的四个开源文本和图像数据集合,Snorkel提供的平均预测性能比先前的启发式方法提高了132%在大型手工培训集的预测性能平均3.60%之内。PVLDB参考格式:A. Ratner,SH Bach,H. Ehrenberg,J. Fries,S. Wu,C. R’e。浮潜:快速培训数据创建与弱监督。PVLDB,11(3):xxxx-yyyy,2017. DOI:10.14778 / 3157794.3157797 我们在这个新环境中研究建模折衷方案,并提出一个优化器来实现折衷决策的自动化,每个管道执行速度最高可达1.8倍。在与美国退伍军人事务部和美国食品和药物管理局的两项合作中,以及代表其他部署的四个开源文本和图像数据集合,Snorkel提供的平均预测性能比先前的启发式方法提高了132%在大型手工培训集的预测性能平均3.60%之内。PVLDB参考格式:A. Ratner,SH Bach,H. Ehrenberg,J. Fries,S. Wu,C. R’e。浮潜:快速培训数据创建与弱监督。PVLDB,11(3):xxxx-yyyy,2017. DOI:10.14778 / 3157794.3157797 退伍军人事务部和美国食品和药物管理局,以及代表其他部署的四个开源文本和图像数据集,Snorkel提供的平均预测性能比先前的启发式方法提高了132%,平均降低了3.60%大型手工培训集的预测性能。PVLDB参考格式:A. Ratner,SH Bach,H. Ehrenberg,J. Fries,S. Wu,C. R’e。浮潜:快速培训数据创建与弱监督。PVLDB,11(3):xxxx-yyyy,2017. DOI:10.14778 / 3157794.3157797 退伍军人事务部和美国食品和药物管理局,以及代表其他部署的四个开源文本和图像数据集,Snorkel提供的平均预测性能比先前的启发式方法提高了132%,平均降低了3.60%大型手工培训集的预测性能。PVLDB参考格式:A. Ratner,SH Bach,H. Ehrenberg,J. Fries,S. Wu,C. R’e。浮潜:快速培训数据创建与弱监督。PVLDB,11(3):xxxx-yyyy,2017. DOI:10.14778 / 3157794.3157797 大型手工培训集预测性能的60%。PVLDB参考格式:A. Ratner,SH Bach,H. Ehrenberg,J. Fries,S. Wu,C. R’e。浮潜:快速培训数据创建与弱监督。PVLDB,11(3):xxxx-yyyy,2017. DOI:10.14778 / 3157794.3157797 大型手工培训集预测性能的60%。PVLDB参考格式:A. Ratner,SH Bach,H. Ehrenberg,J. Fries,S. Wu,C. R’e。浮潜:快速培训数据创建与弱监督。PVLDB,11(3):xxxx-yyyy,2017. DOI:10.14778 / 3157794.3157797
Continue reading

ABSTRACT

Labeling training data is increasingly the largest bottleneck in deploying machine learning systems. We present Snorkel, a first-of-its-kind system that enables users to train stateof-the-art models without hand labeling any training data. Instead, users write labeling functions that express arbitrary heuristics, which can have unknown accuracies and correlations. Snorkel denoises their outputs without access to ground truth by incorporating the first end-to-end implementation of our recently proposed machine learning paradigm, data programming. We present a flexible interface layer for writing labeling functions based on our experience over the past year collaborating with companies, agencies, and research labs. In a user study, subject matter experts build models 2.8× faster and increase predictive performance an average 45.5% versus seven hours of hand labeling. We study the modeling tradeoffs in this new setting and propose an optimizer for automating tradeoff decisions that gives up to 1.8× speedup per pipeline execution. In two collaborations, with the U.S. Department of Veterans Affairs and the U.S. Food and Drug Administration, and on four open-source text and image data sets representative of other deployments, Snorkel provides 132% average improvements to predictive performance over prior heuristic approaches and comes within an average 3.60% of the predictive performance of large hand-curated training sets. PVLDB Reference Format: A. Ratner, S. H. Bach, H. Ehrenberg, J. Fries, S. Wu, C. R´e. Snorkel: Rapid Training Data Creation with Weak Supervision. PVLDB, 11 (3): xxxx-yyyy, 2017. DOI: 10.14778/3157794.3157797

Continue reading

https://arxiv.org/abs/1803.07828v1
https://github.com/AKSW/KG2Vec

Tommaso Soru
1
, Stefano Ruberto
2
, Diego Moussallem
1
, Edgard Marx
1
, Diego
Esteves
3
, and Axel-Cyrille Ngonga Ngomo
4
1 AKSW, University of Leipzig, D-04109 Leipzig, Germany
{tsoru,moussallem,marx
}@informatik.uni-leipzig.de
2 Gran Sasso Science Institute, INFN, I-67100 L’Aquila, Italy
stefano.ruberto@gssi.infn.it
3
SDA, University of Bonn, D-53113 Bonn, Germany
esteves@cs.uni-bonn.de
4 Data Science Group, Paderborn University, D-33098 Paderborn, Germany
axel.ngonga@upb.de
Abstract. Knowledge Graph Embedding methods aim at representing entities
and relations in a knowledge base as points or vectors in a continuous vector
space. Several approaches using embeddings have shown promising results on
tasks such as link prediction, entity recommendation, question answering, and
triplet classification. However, only a few methods can compute low-dimensional
embeddings of very large knowledge bases. In this paper, we propose KG2VEC
,
a novel approach to Knowledge Graph Embedding based on the skip-gram model.
Instead of using a predefined scoring function, we learn it relying on Long ShortTerm
Memories. We evaluated the goodness of our embeddings on knowledge
graph completion and show that KG2VEC is comparable to the quality of the
scalable state-of-the-art approach RDF2Vec and can process large graphs by parsing
more than a hundred million triples in less than 6 hours on common hardware.\
Continue reading

林开开1,沉士起1,刘志远1,2 *,栾波1,孙茂松1,2 1清华大学计算机科学与技术系,国家智能技术与系统国家重点实验室,清华大学信息科学与技术国家重点实验室,北京,中国2江苏省语言能力协作创新中心

抽象

远程监督关系提取已被广泛用于从文本中找到新的关系事实。然而,遥远的监督不可避免地伴随着错误的标签问题,这些嘈杂的数据将严重损害关系提取的性能。为了缓解这个问题,我们提出了一个关系抽取的句子级关注模型。在这个模型中,我们使用卷积神经网络来嵌入语句的语义。之后,我们在多个实例上构建语句级注意力,这样可以动态减少那些噪音实例的权重。实际数据集的实验结果表明,我们的模型可以充分利用所有信息句子,并有效减少错误标记实例的影响。与基线相比,我们的模型在关系提取方面取得了显着且一致的改进。本文的源代码可以从https://github.com/thunlp/NRE获取。

1介绍

近年来,Freebase(Bollacker et al。,2008),DBpedia(Auer et al。,2007),YAGO(Suchanek et al。,2007)等大型知识库已经建成并得到广泛应用在许多自然语言处理(NLP)任务中,包括网络搜索和问题回答。这些知识库主要由三重格式的关系事实组成,例如(微软,创始人比尔盖茨)。尽管现有的KB包含*

 

通讯作者:刘志远(liuzy@tsinghua.edu.cn)

 

与大量事实相比,与无限的现实世界事实相比,它们还远未完成。为了丰富知识库,已经投入了很多努力来自动发现未知的关系事实。因此,关系抽取(RE)是从纯文本生成关系数据的过程,是NLP中的关键任务。

 

现存最多
Continue reading

Yankai Lin1 , Shiqi Shen1 , Zhiyuan Liu1,2∗ , Huanbo Luan1 , Maosong Sun1,2 1 Department of Computer Science and Technology, State Key Lab on Intelligent Technology and Systems, National Lab for Information Science and Technology, Tsinghua University, Beijing, China 2 Jiangsu Collaborative Innovation Center for Language Competence, Jiangsu, China

Abstract

Distant supervised relation extraction has been widely used to find novel relational facts from text. However, distant supervision inevitably accompanies with the wrong labelling problem, and these noisy data will substantially hurt the performance of relation extraction. To alleviate this issue, we propose a sentence-level attention-based model for relation extraction. In this model, we employ convolutional neural networks to embed the semantics of sentences. Afterwards, we build sentence-level attention over multiple instances, which is expected to dynamically reduce the weights of those noisy instances. Experimental results on real-world datasets show that, our model can make full use of all informative sentences and effectively reduce the influence of wrong labelled instances. Our model achieves significant and consistent improvements on relation extraction as compared with baselines. The source code of this paper can be obtained from https: //github.com/thunlp/NRE.

1 Introduction

In recent years, various large-scale knowledge bases (KBs) such as Freebase (Bollacker et al., 2008), DBpedia (Auer et al., 2007) and YAGO (Suchanek et al., 2007) have been built and widely used in many natural language processing (NLP) tasks, including web search and question answering. These KBs mostly compose of relational facts with triple format, e.g., (Microsoft, founder, Bill Gates). Although existing KBs contain a ∗

 

Corresponding author: Zhiyuan Liu (liuzy@tsinghua.edu.cn).

 

massive amount of facts, they are still far from complete compared to the infinite real-world facts. To enrich KBs, many efforts have been invested in automatically finding unknown relational facts. Therefore, relation extraction (RE), the process of generating relational data from plain text, is a crucial task in NLP.

 

Most existing

Continue reading

利用卷积循环网络模型对生物医学文本中的关系分类学习局部和全局情境

Desh Raj和Sunil Kumar Sahu和Ashish Anand计算机科学与工程系印度技术学院Guwahati Guwahati,印度

抽象

 

生物医学领域中关系分类的任务是复杂的,因为存在从非生物来源获得的样本,如研究文献,出院总结或电子健康记录。这也是使用手动特征工程的分类器的一个限制。在本文中,我们提出了一个卷积递归神经网络(CRNN)架构,它将RNN和CNN按顺序组合起来解决这个问题。我们的方法背后的基本原理是,CNN可以有效地识别周围的粗粒度局部特征,而RNN更适合长期依赖。我们将我们的CRNN模型与两个生物医学数据集上的几条基线进行比较,即i2b2- 2010临床关系提取挑战数据集和SemEval-2013 DDI提取数据集。我们还评估了一种反思池化技术,并与传统的最大池化方法进行比较来报告其性能。我们的结果表明,所提出的模型实现了两个数据集上最先进的性能

1介绍

关系分类是识别一段文本中给定的一对实体之间存在的语义关系的任务。由于大多数搜索查询都是某种形式的二元表达式(Agichtein et al。,2005),现代的问答系统在很大程度上依赖于关系分类作为预处理步骤(Fleischman

等人,2003; Lee等人,2007)。准确的关系分类也有助于话语处理和精确的句子解释。因此,这一任务在过去十年中受到了极大的关注(Mintz等,2009; Surdeanu等,2012)。

在生物医学领域,特别是从数据中提取这些元组对于识别蛋白质和药物的相互作用,疾病的症状和病因等是至关重要的。此外,由于临床数据倾向于从多种(多种)信息来源获得,如期刊文章,出院总结和电子患者记录,所以关系分类成为一项更具挑战性的任务。

为了识别实体之间的关系,可以利用各种各样的词汇,句法或语用线索,这会导致用于分类目的的特征类型具有挑战性。由于这种变化,已经提出了许多方法,其中一些方法依赖于从POS标记,形态分析,依赖性分析和世界知识中提取的特征(Kambhatla,2004; San- tos et al。 ,2015; Suchanek等,2006; Mooney和Bunescu,2005; Bunescu和Mooney,2005)。深度学习架构最近引起了很大兴趣,因为它们能够在不需要显式特征工程的情况下快速提取相关特征。因此,许多卷积和递归神经网络模型(Zeng等,2014; Xu等,

在本文中,我们提出了一个使用递归神经网络(RNN)和卷积神经网络(CNN)分别学习全局和局部背景的模型。我们将其称为CRNN,遵循(Huynh et al。,2016)中使用的命名惯例。我们认为,为了使任何分类任务有效,

1的代码可以

https://github.com/desh2608/ crnn-relation-classification。

发现于:

311

Proceedings of the 21st Conference on Computational Natural Language Learning(CoNLL 2017),第311-321页,加拿大温哥华,2017年8月3日至8月4日。⃝c2017计算语言学协会

Continue reading