BERT在GLUE数据集构建任务

本文介绍了如何下载和使用GLUE数据集,特别是针对MRPC数据集的处理,包括数据下载、转换和存储。同时,讨论了在BERT项目中遇到的tensorflow版本问题,以及如何解决API不兼容的情况,强调了选择与CUDA版本匹配的tensorflow-gpu版本的重要性。

0 Introduction

谷歌开源的BERT项目在Github上,视频讲解可以参考B站上的一个视频

1 GLUE部分基准数据集介绍

  • GLUE数据集官网
  • GLUE数据集下载,建议下载运行这个download_glue_data.py文件进行数据集的下载,如果链接无法打开,运行下面代码,运行后,会自动下载GLUE数据集到本地项目文件夹中,所包含的数据集有CoLA,diagnostic,MNLI,MRPC,QNLI,QQP,RTE,SST-2,STS-B,WNLI等,关于这些数据集的详细中文介绍,参考这篇博客,本例是在MRPC数据集上构建任务。
  • 关于微软的MRPC数据集:本例中是在MRPC数据集上进行构建的,因为MRPC数据集较小,只有3600多条文本数据,但如下面代码中的注释所说,由于版权问题,不再托管MRPC数据集,需要手动下载。下载方式:首先去官网,下载到MSRParaphraseCorpus.msi文件,双击安装后,会产生一个文件夹,里面即包含了MPRC数据。
    数据集搞定后,文件结构如下图
    在这里插入图片描述
    以下是用于下载GLUE数据集的脚本文件download_glue_data.py,如果下载数据集有困难,可以去百度网盘下载
    链接:https://pan.baidu.com/s/1D_AJ_GgWgaPuYbror_jUNg
    提取码:9k9r
    –来自百度网盘超级会员V4的分享
''' Script for downloading all GLUE data.

Note: for legal reasons, we are unable to host MRPC.
You can either use the version hosted by the SentEval team, which is already tokenized, 
or you can download the original data from (https://download.microsoft.com/download/D/4/6/D46FF87A-F6B9-4252-AA8B-3604ED519838/MSRParaphraseCorpus.msi) and extract the data from it manually.
For Windows users, you can run the .msi file. For Mac and Linux users, consider an external library such as 'cabextract' (see below for an example).
You should then rename and place specific files in a folder (see below for an example).

mkdir MRPC
cabextract MSRParaphraseCorpus.msi -d MRPC
cat MRPC/_2DEC3DBE877E4DB192D17C0256E90F1D | tr -d $'\r' > MRPC/msr_paraphrase_train.txt
cat MRPC/_D7B391F9EAFF4B1B8BCE8F21B20B1B61 | tr -d $'\r' > MRPC/msr_paraphrase_test.txt
rm MRPC/_*
rm MSRParaphraseCorpus.msi

1/30/19: It looks like SentEval is no longer hosting their extracted and tokenized MRPC data, so you'll need to download the data from the original source for now.
2/11/19: It looks like SentEval actually *is* hosting the extracted data. Hooray!
'''

import os
import sys
import shutil
import argparse
import tempfile
import urllib.request
import zipfile

import urllib as URLLIB
import urllib.response
import urllib.parse
import io
# from six.moves import urllib


TASKS = ["CoLA", "SST", "MRPC", "QQP", "STS", "MNLI", "QNLI", "RTE", "WNLI", "diagnostic"]
TASK2PATH = {
   
   "CoLA":'https://dl.fbaipublicfiles.com/glue/data/CoLA.zip',
             "SST":'https://dl.fbaipublicfiles.com/glue/data/SST-2.zip',
             "QQP":'https://dl.fbaipublicfiles.com/glue/data/QQP-clean.zip',
             "STS":'https://dl.fbaipublicfiles.com/glue/data/STS-B.zip',
             "MNLI":'https://dl.fbaipublicfiles.com/glue/data/MNLI.zip',
             "QNLI":'https://dl.fbaipublicfiles.com/glue/data/QNLIv2.zip',
             "RTE":'https://dl.fbaipublicfiles.com/glue/data/RTE.zip',
             "WNLI":'https://dl.fbaipublicfiles.com/glue/data/WNLI.zip',
             "diagnostic":'https://dl.fbaipublicfiles.com/glue/data/AX.tsv'}

MRPC_TRAIN = 'https://dl.fbaipublicfiles.com/senteval/senteval_data/msr_paraphrase_train.txt'
MRPC_TEST = 'https://dl.fbaipublicfiles.com/senteval/senteval_data/msr_paraphrase_test.txt'

def download_and_extract(task, data_dir):
    print(
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值