Dark Web ChatGPT Unleashed: Meet DarkBERT

admin May 16, 2023

0 1 minute read

We are in the early stages of the snowball effect caused by the emergence of Large Language Models (LLMs) like ChatGPT. Coupled with the open-sourcing of other of his GPT (Generative Pre-Trained Transformer) models, the number of applications adopting AI is exploding. And as you know, ChatGPT itself can be used to create advanced malware.

Over time, more and more LLMs are applied, each specialized in its own domain and trained on carefully curated data for a specific purpose. And one such application was dropped. That application was trained on data from the dark web itself. DarkBERT, as its Korean creator called it, has appeared — Click the link to the release paper for a full introduction to the dark web itself.

DarkBERT is based on the RoBERTa architecture, an AI approach developed in 2019. Researchers have found that it actually outperforms AI approaches developed in 2019, making a comeback of sorts. The model was significantly undertrained at releasewell below the maximum efficiency.

To train the model, the researchers crawled the dark web through the Tor network’s anonymizing firewall and filtered the raw data (by applying techniques such as deduplication, category balancing, and data preprocessing). Generated a dark web database. DarkBERT is the result of that database being used to feed the RoBERTa large language model. The RoBERTa Large Language Model is a model that can analyze new dark web content and highly coded messages written in a unique dialect and extract useful information from it.

It’s not entirely correct to say that English is the business language of the dark web, but it is specific enough that researchers believe that certain LLMs should be trained in English. After all, they were right. Researchers have shown that DarkBERT outperforms other large-scale language models. This should allow security researchers and law enforcement to penetrate deep into the web. After all, this is where most of the action is.

As with any LLM, this does not mean DarkBERT is finished, and we can continue to improve our results with further training and tuning. It is not yet known how it will be used and what knowledge it can gather.