LexNLP by LexPredict
Date (ISO 8601): 2022-04-15
This dataset contains 69,411 plaintext files, each corresponding to an ArXiv document abstract. Each abstract contains at least one appearance of the substring "agreement".
Each text file in this dataset contains the text of an abstract extracted from the full JSON Lines-formatted dataset (described below). Each file is named after its ArXiv ID and has been given the .txt
file extension. In the case where the ArXiv ID contained a forwardslash (/
), the forwardslash was replaced with an underscore (_
). The text files have a median length of 1057 characters and a mean length of 1100 characters.
The full ArXiv metadata dataset can be found on Kaggle and includes additional information alongside each abstract, such as document authors, comments, DOI, etc. The original dataset was distributed under the CC0: Public Domain license, thereby permitting this modification and redistribution.
Date (ISO 8601): 2022-04-16
This is a partial redistribution of The Atticus Project's CUAD v1 dataset of 510 labeled contracts.
Unlike in the original dataset, the plaintext documents have been organized into their respective contract type categories.
The original dataset is licensed under CC BY 4.0
Notes:
ADUROBIOTECH,INC_06_02_2020-EX-10.7-CONSULTING AGREEMENT.txt
is duplicated as ADUROBIOTECH,INC_06_02_2020-EX-10.7-CONSULTING AGREEMENT(1).txt
in both this redistribution and the original dataset.HarpoonTherapeuticsInc_20200312_10-K_EX-10.18_12051356_EX-10.18_Development Agreement.txt
has a corresponding PDF named HarpoonTherapeuticsInc_20200312_10-K_EX-10.18_12051356_EX-10.18_Development Agreement_Option Agreement.pdf
NETGEAR,INC_04_21_2003-EX-10.16-AMENDMENT TO THE DISTRIBUTOR AGREEMENT BETWEEN INGRAM MICRO AND NETGEAR.txt
has a corresponding PDF named NETGEAR,INC_04_21_2003-EX-10.16-AMENDMENT TO THE DISTRIBUTOR AGREEMENT BETWEEN INGRAM MICRO AND NETGEAR-.pdf
Date (ISO 8601): 2022-04-15
This dataset is a partial redistribution of the case_text_open
data available from the Caselaw Access Project.
Specifically, this dataset contains a subset of the files from the original Caselaw Access Project dataset. These files were randomly drawn from the original data until the subset reached a sum ~144 million characters, not including newlines or spaces. This was done in order to approximately match the character length of a different dataset.
Permission to redistribute is implicitly included on Caselaw Access Project's "About" page, under Usage & access:
Thus far, Illinois, Arkansas, New Mexico, and North Carolina have made this important and positive shift and, as a result, all historical cases from these jurisdictions are freely available to the public without restriction.
This data was downloaded from the Caselaw Access Project in April 2021.
This dataset contains 2387 text files from SEC EDGAR, each with "agreement" in its file name. The documents have been sorted into the following categories:
Date (ISO 8601): 2022-04-16
This dataset contains 10,000 EUR-Lex documents downloaded via http://api.epdb.eu/.
Important excerpts from EUR-Lex's copyright notice are quoted below:
The Commission’s document reuse policy is based on Decision 2011/833/EU. Unless otherwise specified, you can re-use the legal documents published in EUR-Lex for commercial or non-commercial purposes.
The copyright for the editorial content of this website, the summaries of EU legislation and the consolidated texts, which is owned by the EU, is licensed under the Creative Commons Attribution 4.0 International licence.
Date (ISO 8601): 2022-04-11
Extracted from: https://www.govinfo.gov/bulkdata/FR/2021
Converted to text using Apache Tika.
Date (ISO 8601): 2022-04-11
The USPTO backgrounds were downloaded using a derivative of this script: https://github.com/EleutherAI/pile-uspto
This sample contains 4500 text files distributed evenly into 45 directories. Each text file contains the text of a USPTO application background and has been placed into the directory respectively representing the grant's year of issue. These texts were randomly selected from a subset of all backgrounds two thousand or more characters in length.
Date (ISO 8601): 2022-04-19
A sample of SEC EDGAR forms from OpenEDGAR stored in plaintext.
Form | Count |
---|---|
3 | 198 |
4 | 198 |
5 | 200 |
8-K | 197 |
10-K | 199 |
Name | Class | State |
---|---|---|
transformerpreprocessor | TransformerPreprocessor | head_character_n=0, normalizer=<lexnlp.ml.normalizers.Normalizer object> |
transformervectorizer | TransformerVectorizer | vectorizers=(<lexnlp.ml.vectorizers.VectorizerDoc2Vec object>, <lexnlp.ml.vectorizers.VectorizerKeywordSearch object>) |
minmaxscaler | MinMaxScaler | feature_range=(-1.0, 1.0) |
logisticregressioncv | LogisticRegressionCV |
Dataset | Description | Hyperlink |
---|---|---|
corpus/contract-types/0.1 |
A sample of labeled contract types obtained from SEC EDGAR | https://www.sec.gov/edgar.shtml |
corpus/atticus-cuad-v1-plaintext/0.1 |
Atticus CUAD v1 contracts | https://www.atticusprojectai.org/cuad |
LOGISTICREGRESSIONCV
precision recall f1-score support
ADVISORY AGREEMENT 0.64 0.69 0.67 13
AFFILIATE AGREEMENT 0.67 1.00 0.80 2
AGENCY AGREEMENT 0.82 0.64 0.72 14
ARBITRATION AGREEMENT 1.00 1.00 1.00 1
ASSIGNMENT AGREEMENT 0.25 0.40 0.31 5
ASSUMPTION AGREEMENT 0.33 0.30 0.32 10
COLLABORATION AGREEMENT 0.53 0.59 0.56 17
CONFIDENTIALITY AGREEMENT 0.67 0.91 0.77 11
CONTRIBUTION AGREEMENT 0.85 0.79 0.81 14
CO_BRANDING AGREEMENT 0.67 0.50 0.57 4
DEALER AGREEMENT 1.00 1.00 1.00 13
DEPOSIT AGREEMENT 0.71 1.00 0.83 10
DEVELOPMENT AGREEMENT 0.44 0.44 0.44 18
DISTRIBUTION AGREEMENT 0.67 0.67 0.67 18
EMPLOYMENT AGREEMENT 0.82 0.71 0.76 65
ENDORSEMENT AGREEMENT 0.80 0.80 0.80 5
ENTITY STRUCTURE AGREEMENT 0.00 0.00 0.00 3
ESCROW AGREEMENT 0.90 0.75 0.82 12
EXCHANGE AGREEMENT 1.00 0.85 0.92 13
FRANCHISE AGREEMENT 0.93 0.87 0.90 15
HOSTING AGREEMENT 0.50 0.75 0.60 4
INDEMNIFICATION AGREEMENT 0.91 0.91 0.91 11
INTERCREDITOR AGREEMENT 0.89 0.81 0.85 21
INVESTMENT AGREEMENT 0.50 0.50 0.50 6
IP AGREEMENT 1.00 0.67 0.80 3
JOINT FILING AGREEMENT 0.50 0.67 0.57 3
JOINT VENTURE AGREEMENT 0.00 0.00 0.00 2
LEASE AGREEMENT 0.67 0.67 0.67 3
LICENSE AGREEMENT 0.50 0.40 0.44 10
LOAN AGREEMENT 0.74 0.52 0.61 27
MAINTENANCE AGREEMENT 0.44 0.57 0.50 7
MANAGEMENT AGREEMENT 0.50 1.00 0.67 3
MANUFACTURING AGREEMENT 0.29 0.50 0.36 4
MARKETING AGREEMENT 0.33 0.33 0.33 3
MERGER & ACQUISTION AGREEMENT 0.67 0.56 0.61 18
NON-DISCLOSURE AGREEMENT 0.56 0.71 0.63 7
NOT A CONTRACT AGREEMENT 0.63 0.83 0.72 23
OTHER CONTRACT AGREEMENT 0.00 0.00 0.00 3
OUTSOURCING AGREEMENT 0.00 0.00 0.00 4
PROMOTION AGREEMENT 0.00 0.00 0.00 2
REGISTRATION RIGHTS AGREEMENT 0.40 1.00 0.57 2
RESELLER AGREEMENT 0.00 0.00 0.00 2
SALES CONTRACT AGREEMENT 0.56 0.50 0.53 10
SECURITIES SALES AGREEMENT 0.00 0.00 0.00 2
SECURITY AGREEMENT 0.50 0.40 0.44 5
SERVICES AGREEMENT 0.50 0.46 0.48 13
SERVICING AGREEMENT 0.67 0.67 0.67 3
SETTLEMENT AGREEMENT 0.57 0.67 0.62 12
SPONSORSHIP AGREEMENT 1.00 0.83 0.91 6
STOCK OPTION AGREEMENT 0.56 0.79 0.66 24
STRATEGIC ALLIANCE AGREEMENT 1.00 1.00 1.00 6
SUBORDINATION AGREEMENT 0.57 0.67 0.62 6
SUPPLY AGREEMENT 0.43 0.43 0.43 7
TAX ALLOCATION AGREEMENT 0.88 1.00 0.93 7
TRANSPORTATION AGREEMENT 1.00 0.67 0.80 3
TRUST AGREEMENT 0.00 0.00 0.00 3
UNDERWRITING AGREEMENT 1.00 0.88 0.93 8
WAIVER AGREEMENT 0.72 0.87 0.79 15
WARRANT AGREEMENT 0.80 0.86 0.83 14
accuracy 0.68 575
macro avg 0.58 0.61 0.59 575
weighted avg 0.68 0.68 0.67 575
from lexnlp.extract.en.contracts.predictors import ProbabilityPredictorContractType
with open('pipeline_contract_type_classifier.cloudpickle', 'rb') as f:
pipeline_contract_type_classifier: Pipeline = cloudpickle.load(f)
probability_predictor_contract_type: ProbabilityPredictorContractType = \
ProbabilityPredictorContractType(pipeline=pipeline_contract_type_classifier)
probability_predictor_contract_type.detect_contract_type(
text=['This is a sentence.', 'LICENSE AGREEMENT', 'The owner shall be responsible for the license of this software.'],
min_probability=0.5,
)