add data preview and environment set-up notebook

This commit is contained in:
buzz-lightsnack-2007 2024-08-22 11:16:29 +00:00
parent f14ee25d5b
commit c034b70c9a

257
preparation.ipynb Normal file
View file

@ -0,0 +1,257 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# An Evaluation on the Effectiveness of Large Language Models in Self-Detection"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This paper seeks to investigate and evaluate the self-detection abilities of language learning models through its architecture and its execution performance. This Jupyter notebook, in particular, will contain experimental data to evaluate the said performance. \n",
"\n",
"This notebook contains scripts needed for configuring the testing environment and does not perform the tests itself. This would be done in notebooks within the `tests/` directory. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Dependencies"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This section is where the dependencies required for all tests will be installed. For all or most tests, the following are needed: \n",
"- gpt4all\n",
"- kaggle\n",
"- kagglehub\n",
"- pandas"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"vscode": {
"languageId": "shellscript"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Requirement already satisfied: pip in /home/codespace/.python/current/lib/python3.12/site-packages (from -r dependencies.txt (line 1)) (24.2)\n",
"Requirement already satisfied: kaggle in /home/codespace/.python/current/lib/python3.12/site-packages (from -r dependencies.txt (line 2)) (1.6.17)\n",
"Requirement already satisfied: kagglehub in /home/codespace/.python/current/lib/python3.12/site-packages (from -r dependencies.txt (line 3)) (0.2.9)\n",
"Requirement already satisfied: gpt4all in /home/codespace/.python/current/lib/python3.12/site-packages (from -r dependencies.txt (line 4)) (2.8.2)\n",
"Requirement already satisfied: pandas in /home/codespace/.local/lib/python3.12/site-packages (from -r dependencies.txt (line 5)) (2.2.2)\n",
"Requirement already satisfied: six>=1.10 in /home/codespace/.local/lib/python3.12/site-packages (from kaggle->-r dependencies.txt (line 2)) (1.16.0)\n",
"Requirement already satisfied: certifi>=2023.7.22 in /home/codespace/.local/lib/python3.12/site-packages (from kaggle->-r dependencies.txt (line 2)) (2024.7.4)\n",
"Requirement already satisfied: python-dateutil in /home/codespace/.local/lib/python3.12/site-packages (from kaggle->-r dependencies.txt (line 2)) (2.9.0.post0)\n",
"Requirement already satisfied: requests in /home/codespace/.local/lib/python3.12/site-packages (from kaggle->-r dependencies.txt (line 2)) (2.32.3)\n",
"Requirement already satisfied: tqdm in /home/codespace/.python/current/lib/python3.12/site-packages (from kaggle->-r dependencies.txt (line 2)) (4.66.5)\n",
"Requirement already satisfied: python-slugify in /home/codespace/.python/current/lib/python3.12/site-packages (from kaggle->-r dependencies.txt (line 2)) (8.0.4)\n",
"Requirement already satisfied: urllib3 in /home/codespace/.local/lib/python3.12/site-packages (from kaggle->-r dependencies.txt (line 2)) (2.2.2)\n",
"Requirement already satisfied: bleach in /home/codespace/.local/lib/python3.12/site-packages (from kaggle->-r dependencies.txt (line 2)) (6.1.0)\n",
"Requirement already satisfied: packaging in /home/codespace/.local/lib/python3.12/site-packages (from kagglehub->-r dependencies.txt (line 3)) (24.1)\n",
"Requirement already satisfied: numpy>=1.26.0 in /home/codespace/.local/lib/python3.12/site-packages (from pandas->-r dependencies.txt (line 5)) (2.0.1)\n",
"Requirement already satisfied: pytz>=2020.1 in /home/codespace/.local/lib/python3.12/site-packages (from pandas->-r dependencies.txt (line 5)) (2024.1)\n",
"Requirement already satisfied: tzdata>=2022.7 in /home/codespace/.local/lib/python3.12/site-packages (from pandas->-r dependencies.txt (line 5)) (2024.1)\n",
"Requirement already satisfied: webencodings in /home/codespace/.local/lib/python3.12/site-packages (from bleach->kaggle->-r dependencies.txt (line 2)) (0.5.1)\n",
"Requirement already satisfied: text-unidecode>=1.3 in /home/codespace/.python/current/lib/python3.12/site-packages (from python-slugify->kaggle->-r dependencies.txt (line 2)) (1.3)\n",
"Requirement already satisfied: charset-normalizer<4,>=2 in /home/codespace/.local/lib/python3.12/site-packages (from requests->kaggle->-r dependencies.txt (line 2)) (3.3.2)\n",
"Requirement already satisfied: idna<4,>=2.5 in /home/codespace/.local/lib/python3.12/site-packages (from requests->kaggle->-r dependencies.txt (line 2)) (3.7)\n",
"Note: you may need to restart the kernel to use updated packages.\n"
]
}
],
"source": [
"pip install --upgrade -r dependencies.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data Fetch"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For this study, the dataset from [Thite (2023)](https://www.kaggle.com/datasets/sunilthite/llm-detect-ai-generated-text-dataset) will be employed due to its sufficient size and technical limitations. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The code block below imports all the modules needed for this section. "
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import kagglehub\n",
"import json\n",
"import os\n",
"import pandas"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The code below sets up a reference to the files. "
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"FILES = {\"datasets\": {}, \"models\": {}};"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Attempting to download Training_Essay_Data.csv from dataset sunilthite/llm-detect-ai-generated-text-dataset/versions/1\n",
"Finished downloading Training_Essay_Data.csv from dataset sunilthite/llm-detect-ai-generated-text-dataset/versions/1\n",
"The file Training_Essay_Data.csv from dataset sunilthite/llm-detect-ai-generated-text-dataset/versions/1 has been linked.\n"
]
}
],
"source": [
"def download_datasets(): \n",
" # Read the dataset listings. \n",
" SOURCES = json.load(open(\"data/sources.json\"));\n",
"\n",
" # Loop through each dataset target.\n",
" for TARGET in SOURCES.keys(): \n",
" print((\"Attempting to download \" + TARGET + \" from dataset \" + SOURCES[TARGET]));\n",
" \n",
" # Set the location of the Kaggle dataset. \n",
" kagglehub.dataset_download(SOURCES[TARGET]);\n",
" \n",
" # Do not continue when an error has occured. \n",
" print((\"Finished downloading \" + TARGET + \" from dataset \" + SOURCES[TARGET]));\n",
"\n",
" # Link each file.\n",
" if (not(os.path.isfile((\"data/\" + TARGET))) and os.system((\"ln -s ~/.cache/kagglehub/datasets/\" + SOURCES[TARGET] + \"/\" + TARGET + \" data/\" + TARGET))): \n",
" print((\"We werent able to link the file \" + TARGET + \" from dataset \" + SOURCES[TARGET] + \". Please perform this manually.\"));\n",
" else: \n",
" print((\"The file \" + TARGET + \" from dataset \" + SOURCES[TARGET] + \" has been linked.\"));\n",
" FILES[\"datasets\"][TARGET] = open(\"data/\" + TARGET)\n",
"\n",
"download_datasets();"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Below, the code blocks are converted into dataframes for preview purposes. "
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"DATAFRAMES = {};\n",
"\n",
"def convert_datasets(): \n",
" if len(FILES[\"datasets\"].keys()): \n",
" for DATASET_NAME in FILES[\"datasets\"].keys(): \n",
" DATAFRAMES[DATASET_NAME] = pandas.read_csv(FILES[\"datasets\"][DATASET_NAME])\n",
"\n",
"convert_datasets();"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The datasets could be previewed below. "
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Training_Essay_Data.csv\n",
" text generated\n",
"0 Car-free cities have become a subject of incre... 1\n",
"1 Car Free Cities Car-free cities, a concept ga... 1\n",
"2 A Sustainable Urban Future Car-free cities ... 1\n",
"3 Pioneering Sustainable Urban Living In an e... 1\n",
"4 The Path to Sustainable Urban Living In an ... 1\n",
"... ... ...\n",
"29140 There has been a fuss about the Elector Colleg... 0\n",
"29141 Limiting car usage has many advantages. Such a... 0\n",
"29142 There's a new trend that has been developing f... 0\n",
"29143 As we all know cars are a big part of our soci... 0\n",
"29144 Cars have been around since the 1800's and hav... 0\n",
"\n",
"[29145 rows x 2 columns]\n"
]
}
],
"source": [
"def convert_datasets(): \n",
" if len(DATAFRAMES.keys()): \n",
" for DATASET_NAME in DATAFRAMES.keys(): \n",
" print(DATASET_NAME);\n",
" print(DATAFRAMES[DATASET_NAME]);\n",
"\n",
"convert_datasets();"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.1"
}
},
"nbformat": 4,
"nbformat_minor": 2
}