GLUE

GLUE Benchmark Hompage : https://gluebenchmark.com/

GLUE Benchmark

The General Language Understanding Evaluation (GLUE) benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems

gluebenchmark.com

-- 공식 홈페이지 솧개 --

일반언어이해평가(GLUE) 벤치마크는 자연어이해 시스템을 교육, 평가, 분석하기 위한 자원의 집합이다. GULT는 다음으로 구성된다.
기존 데이터셋을 기반으로 구축되고 다양한 데이터셋 크기, 텍스트 장르 및 난이도를 다루도록 선택된 9개의 문장 또는 문장 쌍문 언어 이해 작업의 벤치마크
자연 언어에서 발견되는 광범위한 언어 현상과 관련하여 모델 성능을 평가하고 분석하도록 설계된 진단 데이터 세트
벤치마크에서 성능을 추적하는 공개 리더보드와 진단 세트에서 모델의 성능을 시각화하는 대시보드.
GLUE 벤치마크의 형식은 모델 불가지로 되어 있어 문장 쌍과 문장 쌍을 처리하고 그에 상응하는 예측을 산출할 수 있는 시스템은 누구나 참가할 수 있다. 매개변수 공유나 기타 전달 학습 기법을 이용하여 업무 전반에 걸쳐 정보를 공유하는 모델을 선호하기 위해 벤치마크 과제를 선정한다. GULE의 궁극적인 목표는 일반적이고 강력한 자연 언어 이해 시스템의 개발에 관한 연구를 추진하는 것이다.

GLUE(General Language Understanding Evaluation) 벤치마크는 "강건하고 범용적인 자연어 이해 시스템의 개발" 이라는 목적을 가지고 제작된 데이터 셋입니다. 따라서 GLUE는 자연어 처리 모델을 훈련시키고, 그 성능을 평가 및 비교 분석하기 위한 데이터 셋들로 구성되어 있습니다. 다양하고 해결하기 어려운 9개의 테스크 데이터 셋으로 구성된 GLUE는 모델들의 NLU(자연어 이해 능력)을 평가하기 위해 고안되었으며, 이제는 BERT와 같은 전이학습 모델들을 평가하기 위한 필수적인 벤치마크

과거 자연어 처리 모델들은 대부분 하나의 특정 문제를 잘 해결하기 위해 설계되었습니다. 따라서 End-to-end로 해당 문제를 푸는 것에만 적합하게 훈련된 모델들은 다른 문제 혹은 다른 데이터셋에 대해서 효과적인 성능을 보여주지 못하였습니다.

특정 문제 혹은 특정 데이터셋을 염두에 두고 설계되었기 때문에 해당 모델이 잘 훈련되었는지 확인하는 것은 쉬운 일입니다. 그러나 많은 연구자들이 전이학습(Transfer Learning)과 관련된 연구를 시작하고, 전이학습이 성공함에 따라 모델을 평가하기 위한 새로운 방법론의 필요성이 대두되었습니다.

특정 문제만을 해결하기 위한 End-to-end 방식으로 학습된 Single task model과 달리, 전이학습 모델은 Deep한 모델을 이용해 자연어의 일반화된 이해를 중점으로 학습합니다. 즉, 사전 학습을 통해 일반적인 이해 능력을 가지는 것을 말합니다. 사전 학습을 통해 얻어진 이해 능력은 해당 모델을 다른 특정 테스크를 수행하기 위해 Fine-Tuning 할 때 그 능력이 발휘됩니다.

사전 학습에 사용되었던 입력층과 출력층을 기존 모델에서 제거
입력층과 출력층을 해결하고자 하는 문제에 적합한 층으로 교체
위 과정을 n번의 에폭동안 학습(Fine-Tuning).

새 모델이 이전 모델들보다 더 좋은 성능을 보이는지 평가는 어떻게 할 수 있을까요? 이러한 물음에 답하기 위해 뉴욕대학교 연구진은 한 모델에 대해 여러 테스크들을 훈련 및 평가할 수 있는 GLUE 데이터 셋을 선보였습니다. GLUE 내 9개의 테스크에 각각 점수를 메겨 최종 성능 점수를 계산할 수 있게 되었습니다. GLUE 내에 존재하는 테스크를 모두 해결할 수만 있다면 모델은 어떠한 구조를 가져도, 내부적으로 어떠한 연산을 취해도 문제가 되지 않습니다.

GLUE TASK 정보

Dataset	Description	Umbrella Term(s)	Homepage / paper
CoLA	The Corpus of Linguistic Acceptability: Binary classification: single sentences that are either grammatical or ungrammatical.	Acceptability	https://nyu-mll.github.io/CoLA/
SST-2	Stanford Sentiment Treebank: Binary classification (?): phrases culled from movie reviews scored on their positivity/negativity. Phrases can be positive ,negative, or completely neutral English phrases, see example	Sentiment	https://nlp.stanford.edu/sentiment/index.html
MRPC	The Microsoft Research Paraphrase Corpus: A pair of sentences, classify them as paraphrases or not paraphrases From GLUE paper: , with human annotations of whether the sentences in the pair are semantically equivalent.	Paraphrase	https://www.microsoft.com/en-us/download/details.aspx?id=52398
STS-B	The Semantic Textual Similarity Benchmark (Cer et al., 2017) is based on the datasets for a series of annual challenges for the task of determining the similarity on a continuous scale from 1 to 5 of a pair of sentences drawn from various sources. We use the STS-Benchmark release, which draws from news headlines, video and image captions, and natural language inference data, scored by human annotators.	Sentence Similarity	https://www.aclweb.org/anthology/S17-2001
QQP	The Quora Question Pairs3 dataset is a collection of question pairs from the community question-answering website Quora. Given two questions, the task is to determine whether they are semantically equivalent.	Paraphrase	https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs
MNLI-m	Natural Language Inference	Accuracy
MNLI-mm	The Multi-Genre Natural Language Inference Corpus (Williams et al., 2018) is a crowdsourced collection of sentence pairs with textual entailment annotations. Given a premise sentence and a hypothesis sentence, the task is to predict whether the premise entails the hypothesis, contradicts the hypothesis, or neither (neutral). The premise sentences are gathered from a diverse set of sources, including transcribed speech, popular fiction, and government reports. The test set is broken into two sections: matched, which is drawn from the same sources as the training set, and mismatched, which uses different sources and thus requires domain transfer. We use the standard test set, for which we obtained labels privately from the authors, and evaluate on both sections.	NLI	https://www.nyu.edu/projects/bowman/multinli/
QNLI	The Stanford Question Answering Dataset (Rajpurkar et al. 2016; SQuAD) is a question-answering dataset consisting of question-paragraph pairs, where the one of the sentences in the paragraph (drawn from Wikipedia) contains the answer to the corresponding question (written by an annotator). We automatically convert the original SQuAD dataset into a sentence pair classification task by forming a pair between a question and each sentence in the corresponding context. The task is to determine whether the context sentence contains the answer to the question. We filter out pairs where there is low lexical overlap4 between the question and the context sentence. Specifically, we select all pairs in which the most similar sentence to the question was not the answer sentence, as well as an equal amount of cases in which the correct sentence was the most similar to the question, but another distracting sentence was a close second.	Question Answering / Natural Language Inference	QNLI created in GLUE SQUAD: https://rajpurkar.github.io/SQuAD-explorer/
RTE	The Recognizing Textual Entailment (RTE) datasets come from a series of annual challenges for the task of textual entailment, also known as NLI. We combine the data from RTE1 (Dagan et al., 2006), RTE2 (Bar Haim et al., 2006), RTE3 (Giampiccolo et al., 2007), and RTE5 (Bentivogli et al., 2009) 5 . Each example in these datasets consists of a premise sentence and a hypothesis sentence, gathered from various online news sources. The task is to predict if the premise entails the hypothesis. We convert all the data to a two-class split (entailment or not entailment, where we collapse neutral and contradiction into not entailment for challenges with three classes) for consistency.	Natural Language Inference	https://aclweb.org/aclwiki/Textual_Entailment_Resource_Pool
WNLI	The Winograd Schema Challenge: the system must read a sentence with a pronoun and decide the referent from a list of choices. The examples are constructed to foil simple statistical methods GLUE paper: The task (a slight relaxation of the original Winograd Schema Challenge) is to predict if the sentence with the pronoun substituted is entailed by the original sentence	Coreference / Natural Language Inference	WNLI introduced in GLUE paper Winograd schema challenge here: https://cs.nyu.edu/faculty/davise/papers/WinogradSchemas/WS.html

'Programming > (Python)(Ubuntu)' 카테고리의 다른 글

NLP Tools / 자연어처리 툴 소개 (0)	2020.06.05
BERT 모델을 이용해 SQuAD 1.1v 학습하기 (0)	2020.06.04
우분투 Read-only 문제 해결 (0)	2020.05.29
Python Decorator(데코레이터) @의 의미 (0)	2020.04.30
BERT Pre-trained Model (0)	2020.04.23

GLUE

GLUE

GLUE TASK 정보

'Programming > (Python)(Ubuntu)' 카테고리의 다른 글

'Programming/(Python)(Ubuntu)' Related Articles

티스토리툴바