With the 2024 elections coming up, spam and political texts are more prevalent than ever as political campaigns increasingly turn towards texting potential voters. Over 15 billion political texts were sent in 2022 alone, and that number has been on the rise. This gets quite annoying after a while. The constant messaging has led many of us to ask the question, "Can't we block these texts with AI?"
So, what does it really take to detect and block spam? It seems like there should be several public datasets of spam to train on, but surprisingly, there are very few. The only dataset that comes up is the SMS Spam Collection, which we've nicknamed SpamOrHam. However, it's over a decade old and doesn't contain examples of political messaging.
So why not use our own spam texts? In the past a handful of data points was not enough to train a model, but we managed to create a diverse synthetic dataset by prompting Llama 3.1 405B using only 5 examples of political spam and a few examples of other types of texts. That way, anyone can replicate this experiment.
This is the first of a 4 part blog series. In the next 3, we will go over:
- How to De-duplicate and Clean Synthetic Data [2/4]
- Fine-Tuning Llama 3.1 8B in Under 12 Minutes [3/4]
- Evaluating the trained model
Download the Code!
To follow along with this post in code, you can get the notebook from the Oxen repo:
Getting Started
To get started generating the data, we need several examples for each category. You only need a few examples, so you can use your own. We used 5 examples of political spam, 4 examples of regular spam, and 12 examples of legitimate text messages, but as long as you have at least a few of each it will work.
Setting up the Model
Next, we need to set up a model to generate synthetic data with. We used the Fireworks API:
We also need system prompts. Using some basic prompt engineering, we came up with these:
The system prompt is exactly what it sounds like: the system prompt passed to the model. The user prompt, on the other hand, is simply appended to the end of the list of examples to create the full user message passed to the model.
To see the full prompts, you can look at the notebook, but here is a sample:
You can also download the notebook to try modifying the prompts and experimenting with how they affect the data generation.
Generating the Data
Finally, we can run the model!
Repeat this for each other type of message and you will end up with 3 lists of spam, non-spam, and political text messages.
Processing the Data
Let's save the data as a single parquet file.
Because of the nature of how the data was generated, it's likely that there are duplicates in the data. In the future, we will want to remove them, but for now we just need to save the data.
Saving the Data
To make this easy, let's save the current data to an Oxen repo. This comes with the additional benefit that we will be able to revert to any past versions if we decide to undo an update to the data.
If you don't have an account, you can make one for free.
Creating a Repo
One easy way to create a repo is by using the Oxen Hub interface:
From there, we can clone the repo using oxen clone
in a terminal and move texts.parquet into the cloned repo. If you don't already have the Oxen command line tool, it's very simple to install.
Committing the Data
After that, it's as simple as a couple terminal commands:
oxen add texts.parquet
oxen commit -m "Adding data"
oxen push
What Next?
Now that we have this data, we can train models on it! But not so fast - the quality of the data is important. In the next post, we will discuss how to filter the data to increase model performance and reduce training costs.
Why Oxen?
Oxen.ai makes building, iterating on, and collaborating on machine learning datasets easy.
At its core Oxen is a lightning fast data version control tool optimized for large unstructured datasets. On top of that are features that make working with data easier such as data diffs, natural language queries for tabular files, workspaces, rendering images in tables, and more. We're constantly pushing out new features to make things easier for you. Oh yeah, and it's open source.
If you would like to learn more, star us on GitHub or head to Oxen.ai and create an account.