Nirmol is an open-source dataset and API for detecting Bangla slang words. Detect offensive/bad/slang words in Bangla/Bengali/Banglish sentences. A helpful API and dataset for developers and researchers.
Nirmol (নির্মল) is a Microservice-based offensive language detection API. Detect offensive/bad/slang words in Bangla/Bengali/Banglish sentences. You can set up or host your API on any node js server.Nirmol: Keeping Bangla Online Conversations Clean and Respectfu
📑Documentation: Nirmol Doc
📹Project overview: YouTube
You can download the dataset from the GitHub repository but here is the Direct dataset link . You can download and use this dataset for ML and AI model training.
Nirmol API is based on:
npm package used
Step 1: Clone the Nirmol repository
git clone https://github.com/Sigmakib2/Nirmol.git
Step 2: Go to the Nirmol directory
cd Nirmol
Step 3: Install node modules
npm install
Step 4: Start the project
npm start
Then, open your web browser and navigate to http://localhost:3000, and you should see "Cannot GET /" displayed on the page. To test the api you have to enter something after the '/'. For example "http://localhost:3000/hello world"
The API endpoint analyzes a sentence for offensive/slang words and provides additional information about the sentence.
For example here is a get request and response:
{
"bad_sentence": true,
"bad_word_list": [
"word 1", "word 2"
],
"normal_words": [
"word 1",
"word 2",
"word 3"
],
"badness": "16.67%"
}
You can also use the POST method to get response. This feature was added by Tasnim Anas.
For POST request: the endpoint is "http://localhost:3000/"
and you have to send payload in the body like this:
{
"sentence": "Your sentence here..."
}
Here's what the response means:
This can ignore special symbols like # ! @ etc. Many people on the internet use these types of special symbols within slang words and AI systems cannot detect this most of the time. For example, Hello World can be written like this "He#ll@ W@rl#d" which is so difficult for many AI systems to detect. Here we used a simple approach! When there are special symbols in a word our API ignores them and then checks that word.
This API also ignores emojis🥳
There are some words in Bangla that work as prefixes or suffixes and make other worlds toxic. You can include the prefixes_suffixes.json
file. This API finds those words in a sentence with any word as prefixes or suffixes and declares that whole word as a negative word.
you cannot put any "/" symbol in the given sentence (when you are using GET method). For example, you have a text area where someone writes "Hello world/earth" and you are testing the input value without any validation or sanitization. If you do this then you will face problems like this: "Cannot GET /hello%20world/earth". So you can use the POST method for this.
Suppose you have your list of offensive/bad/slang words. You want to add them to your API. Then how can you do that? Here in this repository, you can find the solution. After cloning the project you will see 3 files: input.txt, nirmol.json, and txt-2-nirmol.js.
Suppose you have your list of offensive/bad/slang words. You want to add them to your API. Then how can you do that? Here in this repository, you can find the solution. After cloning the project you will see 3 files: input.txt, nirmol.json, and txt-2-nirmol.js.
.gitignore
index.js
🟡-> input.txt
🟢-> nirmol.json
nirmol.png
package-lock.json
package.json
prefixes_suffixes.json
README.md
tree.txt
🔴 -> txt-2-nirmol.js
+---datasets
| Nirmol-v1-dataset.csv
|
\---node_modules
Here the input.txt file contains all the offensive/bad/slang words available in the dataset. The nirmol.json contains the same data structurally, and the txt-2-nirmol.js is the script that converts the input.txt into the nirmol.json file
input.txt
node txt-2-nirmol.js
nirmol.json
) to ensure that it reflects the changes made to the text file.