sarahyurick commited on
Commit
82caac8
·
verified ·
1 Parent(s): 134a5ac

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +171 -3
README.md CHANGED
@@ -2,8 +2,176 @@
2
  tags:
3
  - model_hub_mixin
4
  - pytorch_model_hub_mixin
 
5
  ---
6
 
7
- This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:
8
- - Library: [More Information Needed]
9
- - Docs: [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  tags:
3
  - model_hub_mixin
4
  - pytorch_model_hub_mixin
5
+ license: other
6
  ---
7
 
8
+ # Content Type Classifier
9
+
10
+ # Model Overview
11
+
12
+ This is a text classification model designed to categorize documents into one of 11 distinct speech types based on their content. It analyzes and understands the nuances of textual information, enabling accurate classification across a diverse range of content types. The model's classifications include:
13
+
14
+ * Product/Company/Organization/Personal Websites: Informational pages about companies, products, or individuals.
15
+ * Explanatory Articles: Detailed, informative articles that aim to explain concepts or topics.
16
+ * News: News articles covering current events, updates, and factual reporting.
17
+ * Blogs: Personal or opinion-based entries typically found on blogging platforms.
18
+ * MISC: Miscellaneous content that doesn’t fit neatly into the other categories.
19
+ * Boilerplate Content: Standardized text used frequently across documents, often repetitive or generic.
20
+ * Analytical Exposition: Analytical or argumentative pieces with in-depth discussion and evaluation.
21
+ * Online Comments: Short, often informal comments typically found on social media or forums.
22
+ * Reviews: Content sharing opinions or assessments about products, services, or experiences.
23
+ * Books and Literature: Excerpts or full texts from books, literary works, or similar long-form writing.
24
+ * Conversational: Informal, dialogue-like text that mimics a conversational tone.
25
+
26
+ # License
27
+ This model is released under the [NVIDIA Open Model License Agreement](https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf).
28
+
29
+ # References
30
+ * [DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing](https://arxiv.org/abs/2111.09543)
31
+ * [DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://github.com/microsoft/DeBERTa)
32
+
33
+ # Model Architecture
34
+ * The model architecture is Deberta V3 Base
35
+ * Context length is 1024 tokens
36
+
37
+ # How to Use in NVIDIA NeMo Curator
38
+ NeMo Curator improves generative AI model accuracy by processing text, image, and video data at scale for training and customization. It also provides pre-built pipelines for generating synthetic data to customize and evaluate generative AI systems.
39
+
40
+ The inference code for this model is available through the NeMo Curator GitHub repository. Check out this [example notebook](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/distributed_data_classification) to get started.
41
+
42
+ # How to Use in Transformers
43
+ To use the content type classifier, use the following code:
44
+
45
+ ```python
46
+ import torch
47
+ from torch import nn
48
+ from transformers import AutoModel, AutoTokenizer, AutoConfig
49
+ from huggingface_hub import PyTorchModelHubMixin
50
+
51
+ class CustomModel(nn.Module, PyTorchModelHubMixin):
52
+ def __init__(self, config):
53
+ super(CustomModel, self).__init__()
54
+ self.model = AutoModel.from_pretrained(config["base_model"])
55
+ self.dropout = nn.Dropout(config["fc_dropout"])
56
+ self.fc = nn.Linear(self.model.config.hidden_size, len(config["id2label"]))
57
+
58
+ def forward(self, input_ids, attention_mask):
59
+ features = self.model(input_ids=input_ids, attention_mask=attention_mask).last_hidden_state
60
+ dropped = self.dropout(features)
61
+ outputs = self.fc(dropped)
62
+ return torch.softmax(outputs[:, 0, :], dim=1)
63
+
64
+ # Setup configuration and model
65
+ config = AutoConfig.from_pretrained("nvidia/content-type-classifier-deberta")
66
+ tokenizer = AutoTokenizer.from_pretrained("nvidia/content-type-classifier-deberta")
67
+ model = CustomModel.from_pretrained("nvidia/content-type-classifier-deberta")
68
+ model.eval()
69
+
70
+ # Prepare and process inputs
71
+ text_samples = ["Hi, great video! I am now a subscriber."]
72
+ inputs = tokenizer(text_samples, return_tensors="pt", padding="longest", truncation=True)
73
+ outputs = model(inputs["input_ids"], inputs["attention_mask"])
74
+
75
+ # Predict and display results
76
+ predicted_classes = torch.argmax(outputs, dim=1)
77
+ predicted_domains = [config.id2label[class_idx.item()] for class_idx in predicted_classes.cpu().numpy()]
78
+ print(predicted_domains)
79
+ # ['Online Comments']
80
+ ```
81
+
82
+ # Input & Output
83
+ ## Input
84
+ * Input Type: Text
85
+ * Input Format: String
86
+ * Input Parameters: 1D
87
+ * Other Properties Related to Input: Token Limit of 1024 tokens
88
+
89
+ ## Output
90
+ * Output Type: Text Classification
91
+ * Output Format: String
92
+ * Output Parameters: 1D
93
+ * Other Properties Related to Output: None
94
+
95
+ The model takes one or several paragraphs of text as input. Example input:
96
+
97
+ ```
98
+ Brent awarded for leading collaborative efforts and leading SIA International Relations Committee.
99
+
100
+ Mar 20, 2018
101
+
102
+ The Security Industry Association (SIA) will recognize Richard Brent, CEO, Louroe Electronics with the prestigious 2017 SIA Chairman's Award for his work to support leading the SIA International Relations Committee and supporting key government relations initiatives.
103
+
104
+ With his service on the SIA Board of Directors and as Chair of the SIA International Relations Committee, Brent has forged relationships between SIA and agencies like the U.S. Commercial Service. A longtime advocate for government engagement generally and exports specifically, Brent's efforts resulted in the publication of the SIA Export Assistance Guide last year as a tool to assist SIA member companies exploring export opportunities or expanding their participation in trade.
105
+
106
+ SIA Chairman Denis Hébert will present the SIA Chairman's Award to Brent at The Advance, SIA's annual membership meeting, scheduled to occur on Tuesday, April 10, 2018, at ISC West.
107
+
108
+ "As the leader of an American manufacturing company, I have seen great business opportunities in foreign sales," said Brent. "Through SIA, I have been pleased to extend my knowledge and experience to other companies that can benefit from exporting. And that is the power of SIA: To bring together distinct companies to share expertise across vertical markets in a collaborative fashion. I'm pleased to contribute, and I thank the Chairman for his recognition."
109
+
110
+ "As a member of the SIA Board of Directors, Richard Brent is consistently engaged on a variety of issues of importance to the security industry, particularly related to export assistance programs that will help SIA members to grow their businesses," said Hébert. "His contributions in all areas of SIA programming have been formidable, but we owe him a particular debt in sharing his experiences in exporting. Thank you for your leadership, Richard."
111
+
112
+ Hébert will present SIA award recipients, including the SIA Chairman's Award, SIA Committee Chair of the Year Award and Sandy Jones Volunteer of the Year Award, at The Advance, held during ISC West in Rooms 505/506 of the Sands Expo in Las Vegas, Nevada, on Tuesday, April 10, 10:30-11:30 a.m. Find more info and register at https:/​/​www.securityindustry.org/​advance.
113
+
114
+ The Advance is co-located with ISC West, produced by ISC Security Events. Security professionals can register to attend the ISC West trade show and conference, which runs April 10-13, at http:/​/​www.iscwest.com.
115
+ ```
116
+
117
+ The model outputs one of the 11 type-of-speech classes as the predicted domain for each input sample. Example output:
118
+
119
+ ```
120
+ News
121
+ ```
122
+
123
+ # Software Integration
124
+ * Runtime Engine: Python 3.10 and NeMo Curator
125
+ * Supported Hardware Microarchitecture Compatibility: NVIDIA GPU, Volta™ or higher (compute capability 7.0+), CUDA 12 (or above)
126
+ * Preferred/Supported Operating System(s): Ubuntu 22.04/20.04
127
+
128
+ # Training, Testing, and Evaluation Dataset
129
+ ## Training Data
130
+ * Link: Jigsaw Toxic Comments, Jigsaw Unintended Biases Dataset, Toxigen Dataset, Common Crawl, Wikipedia
131
+ * Data collection method by dataset
132
+ * Downloaded
133
+ * Labeling method by dataset
134
+ * Human
135
+ * Properties:
136
+ * 25,000 Common Crawl samples were labeled by an external vendor. Each sample was labeled by three annotators.
137
+ * The model is trained on the 19604 samples that are agreed by at least two annotators.
138
+
139
+ Label distribution:
140
+
141
+ | Category | Count |
142
+ |----------------------------------------|-------|
143
+ | Product/Company/Organization/Personal Websites | 5227 |
144
+ | Blogs | 4930 |
145
+ | News | 2933 |
146
+ | Explanatory Articles | 2457 |
147
+ | Analytical Exposition | 1508 |
148
+ | Online Comments | 982 |
149
+ | Reviews | 512 |
150
+ | Boilerplate Content | 475 |
151
+ | MISC | 267 |
152
+ | Books and Literature | 164 |
153
+ | Conversational | 149 |
154
+
155
+ ## Evaluation
156
+ * Metric: PR-AUC
157
+
158
+ Cross validation PR-AUC on the 19604 samples:
159
+
160
+ ```
161
+ Produ=0.697, Expla=0.668, News=0.859, Blogs=0.872, MISC=0.593, Boile=0.383, Analy=0.371, Onlin=0.753, Revie=0.612, Books=0.462, Conve=0.541, avg AUC=0.6192, accuracy=0.6805
162
+ ```
163
+
164
+ Cross validation PR-AUC on the 7738 subset that are agreed by all three annotators:
165
+
166
+ ```
167
+ Produ=0.869, Expla=0.854, News=0.964, Blogs=0.964, MISC=0.876, Boile=0.558, Analy=0.334, Onlin=0.893, Revie=0.825, Books=0.780, Conve=0.793, avg AUC=0.7917, accuracy=0.8444
168
+ ```
169
+
170
+ # Inference
171
+ * Engine: PyTorch
172
+ * Test Hardware: V100
173
+
174
+ # Ethical Considerations
175
+ NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
176
+
177
+ Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability).