write model testing section

2024-08-22 15:31:33 +00:00 · 2024-08-22 15:31:33 +00:00 · 6754eb8b0b
commit 6754eb8b0b
parent 3d6a1f6b42
2 changed files with 303 additions and 257 deletions
--- a/main.ipynb
+++ b/main.ipynb
--- a/preparation.ipynb
+++ b/preparation.ipynb
@ -1,257 +0,0 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# An Evaluation on the Effectiveness of Large Language Models in Self-Detection"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This paper seeks to investigate and evaluate the self-detection abilities of language learning models through its architecture and its execution performance. This Jupyter notebook, in particular, will contain experimental data to evaluate the said performance. \n",
    "\n",
    "This notebook contains scripts needed for configuring the testing environment and does not perform the tests itself. This would be done in notebooks within the `tests/` directory. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Dependencies"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This section is where the dependencies required for all tests will be installed. For all or most tests, the following are needed: \n",
    "- gpt4all\n",
    "- kaggle\n",
    "- kagglehub\n",
    "- pandas"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "vscode": {
     "languageId": "shellscript"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Requirement already satisfied: pip in /home/codespace/.python/current/lib/python3.12/site-packages (from -r dependencies.txt (line 1)) (24.2)\n",
      "Requirement already satisfied: kaggle in /home/codespace/.python/current/lib/python3.12/site-packages (from -r dependencies.txt (line 2)) (1.6.17)\n",
      "Requirement already satisfied: kagglehub in /home/codespace/.python/current/lib/python3.12/site-packages (from -r dependencies.txt (line 3)) (0.2.9)\n",
      "Requirement already satisfied: gpt4all in /home/codespace/.python/current/lib/python3.12/site-packages (from -r dependencies.txt (line 4)) (2.8.2)\n",
      "Requirement already satisfied: pandas in /home/codespace/.local/lib/python3.12/site-packages (from -r dependencies.txt (line 5)) (2.2.2)\n",
      "Requirement already satisfied: six>=1.10 in /home/codespace/.local/lib/python3.12/site-packages (from kaggle->-r dependencies.txt (line 2)) (1.16.0)\n",
      "Requirement already satisfied: certifi>=2023.7.22 in /home/codespace/.local/lib/python3.12/site-packages (from kaggle->-r dependencies.txt (line 2)) (2024.7.4)\n",
      "Requirement already satisfied: python-dateutil in /home/codespace/.local/lib/python3.12/site-packages (from kaggle->-r dependencies.txt (line 2)) (2.9.0.post0)\n",
      "Requirement already satisfied: requests in /home/codespace/.local/lib/python3.12/site-packages (from kaggle->-r dependencies.txt (line 2)) (2.32.3)\n",
      "Requirement already satisfied: tqdm in /home/codespace/.python/current/lib/python3.12/site-packages (from kaggle->-r dependencies.txt (line 2)) (4.66.5)\n",
      "Requirement already satisfied: python-slugify in /home/codespace/.python/current/lib/python3.12/site-packages (from kaggle->-r dependencies.txt (line 2)) (8.0.4)\n",
      "Requirement already satisfied: urllib3 in /home/codespace/.local/lib/python3.12/site-packages (from kaggle->-r dependencies.txt (line 2)) (2.2.2)\n",
      "Requirement already satisfied: bleach in /home/codespace/.local/lib/python3.12/site-packages (from kaggle->-r dependencies.txt (line 2)) (6.1.0)\n",
      "Requirement already satisfied: packaging in /home/codespace/.local/lib/python3.12/site-packages (from kagglehub->-r dependencies.txt (line 3)) (24.1)\n",
      "Requirement already satisfied: numpy>=1.26.0 in /home/codespace/.local/lib/python3.12/site-packages (from pandas->-r dependencies.txt (line 5)) (2.0.1)\n",
      "Requirement already satisfied: pytz>=2020.1 in /home/codespace/.local/lib/python3.12/site-packages (from pandas->-r dependencies.txt (line 5)) (2024.1)\n",
      "Requirement already satisfied: tzdata>=2022.7 in /home/codespace/.local/lib/python3.12/site-packages (from pandas->-r dependencies.txt (line 5)) (2024.1)\n",
      "Requirement already satisfied: webencodings in /home/codespace/.local/lib/python3.12/site-packages (from bleach->kaggle->-r dependencies.txt (line 2)) (0.5.1)\n",
      "Requirement already satisfied: text-unidecode>=1.3 in /home/codespace/.python/current/lib/python3.12/site-packages (from python-slugify->kaggle->-r dependencies.txt (line 2)) (1.3)\n",
      "Requirement already satisfied: charset-normalizer<4,>=2 in /home/codespace/.local/lib/python3.12/site-packages (from requests->kaggle->-r dependencies.txt (line 2)) (3.3.2)\n",
      "Requirement already satisfied: idna<4,>=2.5 in /home/codespace/.local/lib/python3.12/site-packages (from requests->kaggle->-r dependencies.txt (line 2)) (3.7)\n",
      "Note: you may need to restart the kernel to use updated packages.\n"
     ]
    }
   ],
   "source": [
    "pip install --upgrade -r dependencies.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data Fetch"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For this study, the dataset from [Thite (2023)](https://www.kaggle.com/datasets/sunilthite/llm-detect-ai-generated-text-dataset) will be employed due to its sufficient size and technical limitations. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The code block below imports all the modules needed for this section. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import kagglehub\n",
    "import json\n",
    "import os\n",
    "import pandas"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The code below sets up a reference to the files. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "FILES = {\"datasets\": {}, \"models\": {}};"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Attempting to download Training_Essay_Data.csv from dataset sunilthite/llm-detect-ai-generated-text-dataset/versions/1\n",
      "Finished downloading Training_Essay_Data.csv from dataset sunilthite/llm-detect-ai-generated-text-dataset/versions/1\n",
      "The file Training_Essay_Data.csv from dataset sunilthite/llm-detect-ai-generated-text-dataset/versions/1 has been linked.\n"
     ]
    }
   ],
   "source": [
    "def download_datasets(): \n",
    "    # Read the dataset listings. \n",
    "    SOURCES = json.load(open(\"data/sources.json\"));\n",
    "\n",
    "    # Loop through each dataset target.\n",
    "    for TARGET in SOURCES.keys(): \n",
    "        print((\"Attempting to download \" + TARGET + \" from dataset \" + SOURCES[TARGET]));\n",
    "        \n",
    "        # Set the location of the Kaggle dataset. \n",
    "        kagglehub.dataset_download(SOURCES[TARGET]);\n",
    "        \n",
    "        # Do not continue when an error has occured. \n",
    "        print((\"Finished downloading \" + TARGET + \" from dataset \" + SOURCES[TARGET]));\n",
    "\n",
    "        # Link each file.\n",
    "        if (not(os.path.isfile((\"data/\" + TARGET))) and os.system((\"ln -s ~/.cache/kagglehub/datasets/\" + SOURCES[TARGET] + \"/\" + TARGET + \" data/\" + TARGET))): \n",
    "            print((\"We weren’t able to link the file \" + TARGET + \" from dataset \" + SOURCES[TARGET] + \". Please perform this manually.\"));\n",
    "        else: \n",
    "            print((\"The file \" + TARGET + \" from dataset \" + SOURCES[TARGET] + \" has been linked.\"));\n",
    "            FILES[\"datasets\"][TARGET] = open(\"data/\" + TARGET)\n",
    "\n",
    "download_datasets();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Below, the code blocks are converted into dataframes for preview purposes. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "DATAFRAMES = {};\n",
    "\n",
    "def convert_datasets(): \n",
    "    if len(FILES[\"datasets\"].keys()): \n",
    "        for DATASET_NAME in FILES[\"datasets\"].keys(): \n",
    "            DATAFRAMES[DATASET_NAME] = pandas.read_csv(FILES[\"datasets\"][DATASET_NAME])\n",
    "\n",
    "convert_datasets();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The datasets could be previewed below. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Training_Essay_Data.csv\n",
      "                                                    text  generated\n",
      "0      Car-free cities have become a subject of incre...          1\n",
      "1      Car Free Cities  Car-free cities, a concept ga...          1\n",
      "2        A Sustainable Urban Future  Car-free cities ...          1\n",
      "3        Pioneering Sustainable Urban Living  In an e...          1\n",
      "4        The Path to Sustainable Urban Living  In an ...          1\n",
      "...                                                  ...        ...\n",
      "29140  There has been a fuss about the Elector Colleg...          0\n",
      "29141  Limiting car usage has many advantages. Such a...          0\n",
      "29142  There's a new trend that has been developing f...          0\n",
      "29143  As we all know cars are a big part of our soci...          0\n",
      "29144  Cars have been around since the 1800's and hav...          0\n",
      "\n",
      "[29145 rows x 2 columns]\n"
     ]
    }
   ],
   "source": [
    "def convert_datasets(): \n",
    "    if len(DATAFRAMES.keys()): \n",
    "        for DATASET_NAME in DATAFRAMES.keys(): \n",
    "            print(DATASET_NAME);\n",
    "            print(DATAFRAMES[DATASET_NAME]);\n",
    "\n",
    "convert_datasets();"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }