diff --git a/README.md b/README.md index e9d57f0..7c3e68f 100644 --- a/README.md +++ b/README.md @@ -105,6 +105,10 @@ pseudo_first_names = { } ``` +## Evaluation + +We evaluated the utility and performance of `mailcom`, compared to other open-source pseudonymization tools, on sample emails and public benchmark datasets. The evaluation details are summarized in `evaluation_strategy.md`. The corresponding evaluation scripts are available in the `docs/source/notebooks/quantitative_eval.ipynb` and `docs/source/notebooks/test_other_tools.ipynb` notebooks. + ## Citation To reference the `mailcom` package in any publication, please use the information provided in the [citation file](CITATION.cff). diff --git a/docs/source/notebooks/quantitative_eval.ipynb b/docs/source/notebooks/quantitative_eval.ipynb new file mode 100644 index 0000000..f5bd2d6 --- /dev/null +++ b/docs/source/notebooks/quantitative_eval.ipynb @@ -0,0 +1,1598 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "0", + "metadata": {}, + "source": [ + "# Quantitative Evaluation for `mailcom`" + ] + }, + { + "cell_type": "markdown", + "id": "1", + "metadata": {}, + "source": [ + "## Install packages if needed" + ] + }, + { + "cell_type": "markdown", + "id": "2", + "metadata": {}, + "source": [ + "Recommended: create a clean environment and and install necessary packages." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3", + "metadata": {}, + "outputs": [], + "source": [ + "# install Hugging Face datasets if needed\n", + "%pip install datasets" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4", + "metadata": {}, + "outputs": [], + "source": [ + "# install mailcom/refine_email_checking if needed\n", + "%pip install git+https://github.com/ssciwr/mailcom.git@refine_email_checking" + ] + }, + { + "cell_type": "markdown", + "id": "5", + "metadata": {}, + "source": [ + "## Import necessary packages" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6", + "metadata": {}, + "outputs": [], + "source": [ + "from datasets import load_dataset\n", + "import json\n", + "\n", + "import mailcom\n", + "import pandas as pd\n", + "import ast\n", + "import re" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7", + "metadata": {}, + "outputs": [], + "source": [ + "from pathlib import Path\n", + "eval_folder = Path(\"./eval\")\n", + "eval_folder.mkdir(parents=True, exist_ok=True)" + ] + }, + { + "cell_type": "markdown", + "id": "8", + "metadata": {}, + "source": [ + "## Email address detection" + ] + }, + { + "cell_type": "markdown", + "id": "9", + "metadata": {}, + "source": [ + "### Prepare data" + ] + }, + { + "cell_type": "markdown", + "id": "10", + "metadata": {}, + "source": [ + "We used Hugging Face `Josephgflowers/PII-NER` dataset for this evaluation.\n", + "\n", + "Since we only focus on email address detection, we first filtered the dataset to obtain a subset of sentences that contain email addresses. We then applied the `mailcom` transformation to these sentences and compared the detected email addresses with the ground-truth annotations provided in the dataset." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "11", + "metadata": {}, + "outputs": [], + "source": [ + "# load the dataset\n", + "pii_ner_ds = load_dataset(\"Josephgflowers/PII-NER\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "12", + "metadata": {}, + "outputs": [], + "source": [ + "pii_ner_ds" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "13", + "metadata": {}, + "outputs": [], + "source": [ + "# create a copy of the dataset that contains only the original text and the extracted email addresses\n", + "\n", + "def extract_text_emails(row: str):\n", + " try:\n", + " assistant_output = json.loads(row[\"assistant\"])\n", + " text = row[\"user\"]\n", + " emails = assistant_output.get(\"EMAIL\", [])\n", + " except:\n", + " text = row[\"user\"]\n", + " emails = []\n", + " return {\"text\": text, \"emails\": emails}\n", + "\n", + "email_ds = pii_ner_ds[\"train\"].map(extract_text_emails, remove_columns=pii_ner_ds[\"train\"].column_names)\n", + "email_ds = email_ds.filter(lambda x: (len(x[\"emails\"]) > 0) and \"@\" in x[\"text\"])\n", + "email_ds # 2216 items" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "14", + "metadata": {}, + "outputs": [], + "source": [ + "# no need to run this if using mailcom/refine_email_checking branch\n", + "# email in this dataset usually followed by a comma\n", + "# for simplicity, we added spaces before and after the email addresses to separate email addresses from punctuations\n", + "def add_spaces_around_emails(text: str, emails: list):\n", + " for email in emails:\n", + " text = text.replace(email, f\" {email} \")\n", + " return text\n", + "\n", + "email_ds = email_ds.map(lambda x: {\"text\": add_spaces_around_emails(x[\"text\"], x[\"emails\"])})" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "15", + "metadata": {}, + "outputs": [], + "source": [ + "# Delete bad items from the dataset\n", + "# 3 items where email addresses have ’ (non-standard ASCII apostrophe, 8217) instead of ' (standard ASCII apostrophe, 39)\n", + "# 14 items where email addresses have space in the middle\n", + "# 1 item where the email address is incorrect, \n", + "# -- text starts with \"As a resident of 707 Kade\", \n", + "# -- email \"nishithchawla@bhavsar-buch.com\" (the correct email is \"nsithchawla@bhavsar-buch.com\")\n", + "def is_valid_email(email: str, text: str):\n", + " invalid_emails = (\" \" in email) or\\\n", + " (any(ord(c) == 8217 for c in email)) or\\\n", + " (email == \"nishithchawla@bhavsar-buch.com\" and text.startswith(\"As a resident of 707 Kade\"))\n", + " if invalid_emails:\n", + " return False\n", + " return True\n", + "def is_valid_item(emails: list, text: str):\n", + " for email in emails:\n", + " if not is_valid_email(email, text):\n", + " return False\n", + " return True\n", + "\n", + "email_ds = email_ds.filter(lambda x: is_valid_item(x[\"emails\"], x[\"text\"]))\n", + "assert len(email_ds) == 2198, f\"Expected 2198 items after filtering, but got {len(email_ds)}\"\n", + "email_ds # 2198 items" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "16", + "metadata": {}, + "outputs": [], + "source": [ + "# save the filtered dataset to a csv file for evaluation\n", + "email_ds_df = pd.DataFrame(email_ds)\n", + "email_ds_df[\"emails\"] = email_ds_df[\"emails\"].apply(json.dumps)\n", + "email_ds_df.to_csv(\"eval/email_detection_eval.csv\", index=False)" + ] + }, + { + "cell_type": "markdown", + "id": "17", + "metadata": {}, + "source": [ + "### Run `mailcom` on the dataset" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "18", + "metadata": {}, + "outputs": [], + "source": [ + "# load workflow configuration\n", + "new_settings = {\n", + " \"default_lang\": \"en\", # only English in the dataset\n", + " \"pseudo_fields\": [\"content\"], # we don't consider subject here\n", + " \"pseudo_emailaddresses\": True, # we only pseudo email addresses\n", + " \"pseudo_ne\": False,\n", + " \"pseudo_numbers\": False,\n", + " \"datetime_detection\": False,\n", + "}\n", + "\n", + "# save the updated configuration to a file for reproducibility purposes\n", + "new_settings_dir = eval_folder\n", + "workflow_settings = mailcom.get_workflow_settings(new_settings=new_settings, \n", + " updated_setting_dir= new_settings_dir,\n", + " save_updated_settings=True)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "19", + "metadata": {}, + "outputs": [], + "source": [ + "# load csv file into input handler\n", + "input_csv = \"eval/email_detection_eval.csv\"\n", + "# the columns of the csv that should be passed through the processing pipeline/retained in the pipeline\n", + "matching_columns = [\"text\"]\n", + "# the predefined keys that should be used to match these columns, in the correct order\n", + "pre_defined_keys = [\"content\"]\n", + "# what to call any columns that are not matched to pre-defined keys\n", + "# get this from the workflow settings\n", + "unmatched_keyword = workflow_settings.get(\"csv_col_unmatched_keyword\")\n", + "\n", + "input_handler = mailcom.get_input_handler(in_path=input_csv, in_type=\"csv\", \n", + " col_names=matching_columns, \n", + " init_data_fields=pre_defined_keys, \n", + " unmatched_keyword=unmatched_keyword)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "20", + "metadata": {}, + "outputs": [], + "source": [ + "# process the input data\n", + "mailcom.process_data(input_handler.get_email_list(), workflow_settings) # 22.3s on laptop" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "21", + "metadata": {}, + "outputs": [], + "source": [ + "# convert the processed data into a dataframe\n", + "email_df = pd.DataFrame(input_handler.get_email_list())" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "22", + "metadata": {}, + "outputs": [], + "source": [ + "# only keep the content, pseudo_content, and sentences columns\n", + "filtered_email_df = email_df[[\"content\", \"pseudo_content\", \"sentences\"]]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "23", + "metadata": {}, + "outputs": [], + "source": [ + "# add pseudo_content to the original dataset\n", + "org_email_df = pd.read_csv(input_csv)\n", + "# convert the emails column from string to list\n", + "org_email_df[\"emails\"] = org_email_df[\"emails\"].apply(ast.literal_eval)\n", + "\n", + "# check before merging\n", + "(org_email_df[\"text\"].sort_values().reset_index(drop=True) ==\n", + " filtered_email_df[\"content\"].sort_values().reset_index(drop=True)).all()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "24", + "metadata": {}, + "outputs": [], + "source": [ + "# merge the pseudo_content into the original datasetn_df[\n", + "merged_email_df = org_email_df.merge(filtered_email_df, left_on=\"text\", right_on=\"content\", how=\"inner\", validate=\"one_to_one\")\n", + "\n", + "assert len(merged_email_df) == len(org_email_df) == len(filtered_email_df), f\"Expected unchanged number of items after merging, but got {len(merged_email_df)}\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "25", + "metadata": {}, + "outputs": [], + "source": [ + "merged_email_df = merged_email_df.drop(columns=[\"content\"])\n", + "merged_email_df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "26", + "metadata": {}, + "outputs": [], + "source": [ + "prev_email_checking = False # set to True if using mailcom/main branch and refine_email_checking has not been merged yet\n", + "\n", + "# create expected content column as ground truth for evaluation\n", + "def clean_sentence(sentence, prev_email_checking=True):\n", + " if not prev_email_checking:\n", + " # no need to clean the sentence if using regex for email checking\n", + " return sentence\n", + " # normalize whitespace to make it match the way mailcom clean the sentences\n", + " normalized_sentence = \" \".join(re.split(r\"\\s+\", sentence))\n", + " return normalized_sentence\n", + " \n", + "def replace_email_with_placeholder(sentences, email_list):\n", + " replaced_sents = []\n", + " for sentence in sentences:\n", + " replaced_sent = clean_sentence(sentence, prev_email_checking)\n", + " for email in email_list:\n", + " replaced_sent = replaced_sent.replace(email, \"[email]\")\n", + " replaced_sents.append(replaced_sent)\n", + " return \" \".join(replaced_sents)\n", + "\n", + "merged_email_df[\"expected_content\"] = merged_email_df.apply(lambda row: replace_email_with_placeholder(row[\"sentences\"].get(\"content\"), row[\"emails\"]), axis=1)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "27", + "metadata": {}, + "outputs": [], + "source": [ + "merged_email_df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "28", + "metadata": {}, + "outputs": [], + "source": [ + "# calculate the exact match accuracy between the pseudo_content and the expected_content\n", + "merged_email_df[\"exact_match\"] = merged_email_df.apply(lambda row: row[\"pseudo_content\"] == row[\"expected_content\"], axis=1)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "29", + "metadata": {}, + "outputs": [], + "source": [ + "# calculate tp, fp, fn\n", + "def calculate_tp_fp_fn(row):\n", + " pred_count = len(re.findall(r\"\\[email\\]\", row[\"pseudo_content\"]))\n", + " gold_count = len(re.findall(r\"\\[email\\]\", row[\"expected_content\"]))\n", + "\n", + " true_positives = min(pred_count, gold_count)\n", + " false_positives = max(pred_count - gold_count, 0)\n", + " false_negatives = max(gold_count - pred_count, 0)\n", + " \n", + " return pd.Series({\"tp\": true_positives, \"fp\": false_positives, \"fn\": false_negatives})\n", + "\n", + "metrics_df = merged_email_df.apply(calculate_tp_fp_fn, axis=1)\n", + "merged_email_df = pd.concat([merged_email_df, metrics_df], axis=1)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "30", + "metadata": {}, + "outputs": [], + "source": [ + "merged_email_df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "31", + "metadata": {}, + "outputs": [], + "source": [ + "# save the results to a csv file for further analysis\n", + "merged_email_df.to_csv(\"eval/email_detection_results.csv\", index=False)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "32", + "metadata": {}, + "outputs": [], + "source": [ + "def cal_eval_metrics(df):\n", + " total_tp = df[\"tp\"].sum()\n", + " total_fp = df[\"fp\"].sum()\n", + " total_fn = df[\"fn\"].sum()\n", + "\n", + " precision = total_tp / (total_tp + total_fp) if (total_tp + total_fp) > 0 else 0\n", + " recall = total_tp / (total_tp + total_fn) if (total_tp + total_fn) > 0 else 0\n", + " f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0\n", + " \n", + " accuracy = df[\"exact_match\"].mean()\n", + " print(f\"Exact match accuracy: {accuracy:.4f}\")\n", + " print(f\"Micro Precision: {precision:.4f}\")\n", + " print(f\"Micro Recall: {recall:.4f}\")\n", + " print(f\"Micro F1 Score: {f1:.4f}\")\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "33", + "metadata": {}, + "outputs": [], + "source": [ + "cal_eval_metrics(merged_email_df)\n", + "# Results when using simple check \"@\" in sentence\n", + "# Exact match accuracy: 0.7302\n", + "# Micro Precision: 0.7741\n", + "# Micro Recall: 1.0000\n", + "# Micro F1 Score: 0.8727\n", + "# recall > precision, as mailcom also marks some mentioning form, e.g. @username as email addresses.\n", + "\n", + "# Results when using regex for email checking\n", + "# Exact match accuracy: 0.9995\n", + "# Micro Precision: 1.0000\n", + "# Micro Recall: 1.0000\n", + "# Micro F1 Score: 1.0000\n", + "# There is only one failed case:\n", + "# -- sentence: With an email address that reflects her professional prowess-bhamini.gulati@shukla.biz-she navigates the digital realm with ease.\n", + "# -- expected content: With an email address that reflects her professional prowess-[email]-she navigates the digital realm with ease.\n", + "# -- pseudo content: With an email address that reflects her professional [email]-she navigates the digital realm with ease." + ] + }, + { + "cell_type": "markdown", + "id": "34", + "metadata": {}, + "source": [ + "## NER detection" + ] + }, + { + "cell_type": "markdown", + "id": "35", + "metadata": {}, + "source": [ + "### Prepare data" + ] + }, + { + "cell_type": "markdown", + "id": "36", + "metadata": {}, + "source": [ + "We use Hugging Face `Babelscape/wikineural` dataset (`test_en`, `11,597` rows) for this evaluation.\n", + "\n", + "The dataset contains tokens and their corresponding NER tags.\n", + "\n", + "`{'O': 0, 'B-PER': 1, 'I-PER': 2, 'B-ORG': 3, 'I-ORG': 4, 'B-LOC': 5, 'I-LOC': 6, 'B-MISC': 7, 'I-MISC': 8}`\n", + "\n", + "We first converted the tokens into sentence and extract the named entities based on the NER tags. We then applied the `mailcom` transformation to these sentences and compared the detected named entities with the ground-truth annotations provided in the dataset." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "37", + "metadata": {}, + "outputs": [], + "source": [ + "# load the dataset\n", + "ner_ds = load_dataset(\"Babelscape/wikineural\", split=\"test_en\")\n", + "ner_ds # 11597 items" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "38", + "metadata": {}, + "outputs": [], + "source": [ + "# NER tags\n", + "ner_tags_def = {'O': 0, 'B-PER': 1, 'I-PER': 2, 'B-ORG': 3, 'I-ORG': 4, 'B-LOC': 5, 'I-LOC': 6, 'B-MISC': 7, 'I-MISC': 8}\n", + "# convert key-value to value-key for easier mapping\n", + "ner_tags = {v: k for k, v in ner_tags_def.items()}\n", + "ner_tags" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "39", + "metadata": {}, + "outputs": [], + "source": [ + "# map int in ner_tags to their corresponding string labels\n", + "def map_ner_tags(row):\n", + " ner_tags_int = row[\"ner_tags\"]\n", + " ner_tags_str = [ner_tags.get(tag, \"O\") for tag in ner_tags_int]\n", + " return {\"ner_tags_str\": ner_tags_str}\n", + "\n", + "ner_ds_mapped = ner_ds.map(map_ner_tags)\n", + "ner_ds_mapped" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "40", + "metadata": {}, + "outputs": [], + "source": [ + "# regex to clean up spaces surrounding non-word chars\n", + "multiple_spaces = re.compile(r\"\\s+\")\n", + "\n", + "space_before_punc = re.compile(r\"\\s([?.!,;:%])\")\n", + "\n", + "space_after_open_bracket = re.compile(r\"([(\\[{])\\s\")\n", + "space_before_close_bracket = re.compile(r\"\\s([)\\]}])\")\n", + "\n", + "apostrophe_between_words = re.compile(r\"(\\w)\\s'\\s(\\w)\")\n", + "apostrophe_between_words_special = re.compile(r\"(\\w)\\s’\\s(\\w)\")\n", + "\n", + "space_in_quotes = re.compile(r'([\"“‘])\\s(.*?)\\s([\"”’])') # assuming that single quotes are used as apostrophes, not quotation marks\n", + "\n", + "# for each row, convert tokens into sentence or phrase and extract NE\n", + "def convert_tokens_to_phrase(tokens):\n", + " sentence = \" \".join(tokens)\n", + "\n", + " # post processing, order is important!\n", + " # replace multiple spaces with a single space\n", + " sentence = multiple_spaces.sub(\" \", sentence).strip()\n", + " # remove space before punctuations\n", + " sentence = space_before_punc.sub(r\"\\1\", sentence)\n", + " # fix space after opening brackets\n", + " sentence = space_after_open_bracket.sub(r\"\\1\", sentence)\n", + " # fix space before closing brackets\n", + " sentence = space_before_close_bracket.sub(r\"\\1\", sentence)\n", + " # fix apostrophes between words\n", + " sentence = apostrophe_between_words.sub(r\"\\1'\\2\", sentence)\n", + " sentence = apostrophe_between_words_special.sub(r\"\\1’\\2\", sentence)\n", + " # fix space in quotes\n", + " sentence = space_in_quotes.sub(r'\\1\\2\\3', sentence)\n", + " return sentence\n", + "\n", + "def get_token_offsets(sentence: str, tokens: list):\n", + " offsets = []\n", + " current_pos = 0\n", + " for token in tokens:\n", + " start_idx = sentence.find(token, current_pos)\n", + " if start_idx == -1:\n", + " raise ValueError(f\"Token '{token}' not found in sentence starting from position {current_pos}\")\n", + " end_idx = start_idx + len(token)\n", + " offsets.append((start_idx, end_idx))\n", + " current_pos = end_idx\n", + " return offsets\n", + "\n", + "def build_entity(tokens, entity_type, start, end):\n", + " return {\n", + " \"type\": entity_type,\n", + " \"text\": convert_tokens_to_phrase(tokens),\n", + " \"start\": start,\n", + " \"end\": end\n", + " }\n", + "\n", + "def convert_tokens_to_sentence_and_extract_ne(row):\n", + " tokens = row[\"tokens\"]\n", + " ner_tags_list = row[\"ner_tags_str\"]\n", + " \n", + " # convert tokens to sentence\n", + " sentence = convert_tokens_to_phrase(tokens)\n", + " \n", + " entities = [] # list of dict\n", + " current_entity = []\n", + " current_entity_type = None\n", + " token_offsets = get_token_offsets(sentence, tokens)\n", + " start_pos = None\n", + " end_pos = None\n", + "\n", + " def flush():\n", + " nonlocal current_entity, current_entity_type, start_pos, end_pos\n", + " if current_entity:\n", + " entities.append(build_entity(current_entity, current_entity_type, start_pos, end_pos))\n", + " current_entity = []\n", + " current_entity_type = None\n", + " \n", + " for token, tag, (start, end) in zip(tokens, ner_tags_list, token_offsets):\n", + " if tag == \"O\": # reset case\n", + " flush()\n", + " continue\n", + "\n", + " prefix, _, ent_type = tag.partition(\"-\")\n", + "\n", + " if prefix == \"B\":\n", + " flush() # store the previous entity before starting a new one\n", + " current_entity_type = ent_type\n", + " current_entity = [token]\n", + " start_pos = start\n", + " end_pos = end\n", + " continue\n", + "\n", + " if prefix == \"I\" and current_entity_type == ent_type:\n", + " # continue the current entity\n", + " current_entity.append(token)\n", + " end_pos = end\n", + " continue\n", + "\n", + " # handle unexpected cases (e.g., I- tag without a preceding B- tag or mismatched entity types)\n", + " flush()\n", + "\n", + " # Check if there's an entity at the end of the sentence\n", + " flush()\n", + " \n", + " return {\"sentence\": sentence, \"entities\": entities}" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "41", + "metadata": {}, + "outputs": [], + "source": [ + "ner_ds_extracted = ner_ds_mapped.map(convert_tokens_to_sentence_and_extract_ne)\n", + "ner_ds_extracted" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "42", + "metadata": {}, + "outputs": [], + "source": [ + "# only keep sentence and entities columns\n", + "ner_ds_filtered = ner_ds_extracted.remove_columns([col for col in ner_ds_extracted.column_names if col not in [\"tokens\", \"sentence\", \"entities\"]])\n", + "ner_ds_filtered" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "43", + "metadata": {}, + "outputs": [], + "source": [ + "# convert dataset to dataframe and serialize the list of dics as JSON strings\n", + "# to make sure that there is comma between dicts after saving to csv\n", + "df_ner_ds = pd.DataFrame(ner_ds_filtered)\n", + "df_ner_ds[\"entities\"] = df_ner_ds[\"entities\"].apply(json.dumps)\n", + "df_ner_ds[\"tokens\"] = df_ner_ds[\"tokens\"].apply(json.dumps)\n", + "df_ner_ds.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "44", + "metadata": {}, + "outputs": [], + "source": [ + "# save the processed dataset to a csv file for evaluation\n", + "df_ner_ds.to_csv(\"eval/ner_detection_eval.csv\", index=False)" + ] + }, + { + "cell_type": "markdown", + "id": "45", + "metadata": {}, + "source": [ + "### Run `mailcom` on the dataset" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "46", + "metadata": {}, + "outputs": [], + "source": [ + "# load workflow configuration\n", + "new_settings = {\n", + " \"default_lang\": \"en\", # only English in the dataset\n", + " \"pseudo_fields\": [\"content\"], # we don't consider subject here\n", + " \"pseudo_emailaddresses\": False,\n", + " \"pseudo_ne\": True, # we want to pseudo NE for this evaluation\n", + " \"pseudo_numbers\": False,\n", + " \"datetime_detection\": False,\n", + "}\n", + "\n", + "# save the updated configuration to a file for reproducibility purposes\n", + "new_settings_dir = eval_folder\n", + "workflow_settings = mailcom.get_workflow_settings(new_settings=new_settings, \n", + " updated_setting_dir= new_settings_dir,\n", + " save_updated_settings=True)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "47", + "metadata": {}, + "outputs": [], + "source": [ + "# load csv file into input handler\n", + "input_csv = \"eval/ner_detection_eval.csv\"\n", + "# the columns of the csv that should be passed through the processing pipeline/retained in the pipeline\n", + "matching_columns = [\"sentence\"]\n", + "# the predefined keys that should be used to match these columns, in the correct order\n", + "pre_defined_keys = [\"content\"]\n", + "# what to call any columns that are not matched to pre-defined keys\n", + "# get this from the workflow settings\n", + "unmatched_keyword = workflow_settings.get(\"csv_col_unmatched_keyword\")\n", + "\n", + "input_handler = mailcom.get_input_handler(in_path=input_csv, in_type=\"csv\", \n", + " col_names=matching_columns, \n", + " init_data_fields=pre_defined_keys, \n", + " unmatched_keyword=unmatched_keyword)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "48", + "metadata": {}, + "outputs": [], + "source": [ + "# process the input data\n", + "mailcom.process_data(input_handler.get_email_list(), workflow_settings)\n", + "# took around 2 minutes 35.2 seconds on SSC14 (desktop)\n", + "# on laptop, 32 GB RAM, no GPU, it took 16 minutes 48.3 seconds (other applications were running too)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "49", + "metadata": {}, + "outputs": [], + "source": [ + "# convert the processed data into a dataframe\n", + "ner_df = pd.DataFrame(input_handler.get_email_list())\n", + "ner_df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "50", + "metadata": {}, + "outputs": [], + "source": [ + "# only keep the content, pseudo_content, and sentences columns\n", + "filtered_ner_df = ner_df[[\"content\", \"pseudo_content\", \"sentences\", \"ne_list\"]]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "51", + "metadata": {}, + "outputs": [], + "source": [ + "# add pseudo_content to the original dataset\n", + "org_ner_df = pd.read_csv(input_csv)\n", + "# deserialize JSON back to Python objects for entities and tokens columns\n", + "org_ner_df[\"entities\"] = org_ner_df[\"entities\"].apply(json.loads)\n", + "org_ner_df[\"tokens\"] = org_ner_df[\"tokens\"].apply(json.loads)\n", + "\n", + "# check before merging\n", + "(org_ner_df[\"sentence\"].sort_values().reset_index(drop=True) ==\n", + " filtered_ner_df[\"content\"].sort_values().reset_index(drop=True)).all()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "52", + "metadata": {}, + "outputs": [], + "source": [ + "# merge the pseudo_content into the original dataset\n", + "merged_ner_df = org_ner_df.merge(filtered_ner_df, left_on=\"sentence\", right_on=\"content\", how=\"inner\", validate=\"one_to_one\")\n", + "\n", + "assert len(merged_ner_df) == len(org_ner_df) == len(filtered_ner_df), f\"Expected unchanged number of items after merging, but got {len(merged_ner_df)}\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "53", + "metadata": {}, + "outputs": [], + "source": [ + "merged_ner_df = merged_ner_df.drop(columns=[\"content\"])\n", + "merged_ner_df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "54", + "metadata": {}, + "outputs": [], + "source": [ + "# calculate tp, fp, fn for each row\n", + "def calculate_tp_fp_fn(row):\n", + " gold_data = row[\"entities\"]\n", + " pred_data = row[\"ne_list\"].get(\"content\", [])\n", + "\n", + " # normalize data to get tuple of (type, text, start, end)\n", + " norm_gold_data = {\n", + " (e[\"type\"], e[\"text\"], e[\"start\"], e[\"end\"]) for e in gold_data\n", + " }\n", + " norm_pred_data = {\n", + " (e[\"entity_group\"], e[\"word\"], e[\"start\"], e[\"end\"]) for e in pred_data\n", + " }\n", + "\n", + " tp = len(norm_gold_data & norm_pred_data)\n", + " fp = len(norm_pred_data - norm_gold_data)\n", + " fn = len(norm_gold_data - norm_pred_data)\n", + "\n", + " return pd.Series({\"tp\": tp, \"fp\": fp, \"fn\": fn})" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "55", + "metadata": {}, + "outputs": [], + "source": [ + "def cal_eval_metrics(dataframe):\n", + " # add tp, fp, fn columns to the dataframe\n", + " metrics_df = dataframe.apply(calculate_tp_fp_fn, axis=1)\n", + " result_df = pd.concat([dataframe, metrics_df], axis=1)\n", + "\n", + " # calculate precision, recall, and F1 score for all rows\n", + " tp = result_df[\"tp\"].sum()\n", + " fp = result_df[\"fp\"].sum()\n", + " fn = result_df[\"fn\"].sum()\n", + "\n", + " precision = tp / (tp + fp) if (tp + fp) > 0 else 0\n", + " recall = tp / (tp + fn) if (tp + fn) > 0 else 0\n", + " f1_score = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0\n", + "\n", + " print(f\"Precision: {precision:.4f}\")\n", + " print(f\"Recall: {recall:.4f}\")\n", + " print(f\"F1 Score: {f1_score:.4f}\")\n", + "\n", + " return result_df\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "56", + "metadata": {}, + "outputs": [], + "source": [ + "result_df = cal_eval_metrics(merged_ner_df)\n", + "result_df.head()\n", + "# Precision: 0.7707\n", + "# Recall: 0.8401\n", + "# F1 Score: 0.8039" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "57", + "metadata": {}, + "outputs": [], + "source": [ + "# save the results to a csv file for further analysis\n", + "result_df[\"entities\"] = result_df[\"entities\"].apply(json.dumps)\n", + "result_df[\"ne_list\"] = result_df[\"ne_list\"].apply(json.dumps)\n", + "result_df[\"tokens\"] = result_df[\"tokens\"].apply(json.dumps)\n", + "result_df.to_csv(\"eval/ner_detection_results.csv\", index=False)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "58", + "metadata": {}, + "outputs": [], + "source": [ + "# originally, each content only has one sentence\n", + "# check if sentensizing work as expected\n", + "merged_ner_df[\"sentences\"].apply(lambda x: len(x.get(\"content\", []))).value_counts()\n", + "# sentences -- count\n", + "# 1 -- 11531\n", + "# 2 -- 60\n", + "# 3 -- 6" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "59", + "metadata": {}, + "outputs": [], + "source": [ + "# check if non-single sentence cases affect the evaluation results\n", + "single_sentence_df = merged_ner_df[merged_ner_df[\"sentences\"].apply(lambda x: len(x.get(\"content\", [])) == 1)]\n", + "single_result_df = cal_eval_metrics(single_sentence_df)\n", + "single_result_df.head()\n", + "# Precision: 0.7743\n", + "# Recall: 0.8431\n", + "# F1 Score: 0.8072" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "60", + "metadata": {}, + "outputs": [], + "source": [ + "# check double sentence cases\n", + "double_sentence_df = merged_ner_df[merged_ner_df[\"sentences\"].apply(lambda x: len(x.get(\"content\", [])) == 2)]\n", + "double_result_df = cal_eval_metrics(double_sentence_df)\n", + "double_result_df.head()\n", + "# Precision: 0.3150\n", + "# Recall: 0.3883\n", + "# F1 Score: 0.3478" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "61", + "metadata": {}, + "outputs": [], + "source": [ + "# check triple sentence cases\n", + "triple_sentence_df = merged_ner_df[merged_ner_df[\"sentences\"].apply(lambda x: len(x.get(\"content\", [])) == 3)]\n", + "triple_result_df = cal_eval_metrics(triple_sentence_df)\n", + "triple_result_df.head()\n", + "# Precision: 0.0000\n", + "# Recall: 0.0000\n", + "# F1 Score: 0.0000" + ] + }, + { + "cell_type": "markdown", + "id": "62", + "metadata": {}, + "source": [ + "### Post analysis" + ] + }, + { + "cell_type": "markdown", + "id": "63", + "metadata": {}, + "source": [ + "Since Presidio does not consider MISC, recalculate the metrics by excluding MISC from both the ground-truth and the detected entities." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "64", + "metadata": {}, + "outputs": [], + "source": [ + "# load the results saved above\n", + "result_df = pd.read_csv(\"eval/ner_detection_results.csv\")\n", + "# deserialize JSON back to Python objects for entities and ne_list columns\n", + "result_df[\"entities\"] = result_df[\"entities\"].apply(json.loads)\n", + "result_df[\"ne_list\"] = result_df[\"ne_list\"].apply(json.loads)\n", + "result_df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "65", + "metadata": {}, + "outputs": [], + "source": [ + "considered_types = [\"PER\", \"ORG\", \"LOC\"] # we only consider these 3 types as Presidio does not consider MISC" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "66", + "metadata": {}, + "outputs": [], + "source": [ + "# calculate tp, fp, fn for each row\n", + "def calculate_tp_fp_fn(row):\n", + " gold_data = row[\"entities\"]\n", + " pred_data = row[\"ne_list\"].get(\"content\", [])\n", + "\n", + " # normalize data to get tuple of (type, text, start, end)\n", + " norm_gold_data = {\n", + " (e[\"type\"], e[\"text\"], e[\"start\"], e[\"end\"]) for e in gold_data if e[\"type\"] in considered_types\n", + " }\n", + " norm_pred_data = {\n", + " (e[\"entity_group\"], e[\"word\"], e[\"start\"], e[\"end\"]) for e in pred_data if e[\"entity_group\"] in considered_types\n", + " }\n", + "\n", + " tp = len(norm_gold_data & norm_pred_data)\n", + " fp = len(norm_pred_data - norm_gold_data)\n", + " fn = len(norm_gold_data - norm_pred_data)\n", + "\n", + " return pd.Series({\"tp_filtered\": tp, \"fp_filtered\": fp, \"fn_filtered\": fn})" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "67", + "metadata": {}, + "outputs": [], + "source": [ + "def cal_eval_metrics(dataframe):\n", + " # add tp, fp, fn columns to the dataframe\n", + " metrics_df = dataframe.apply(calculate_tp_fp_fn, axis=1)\n", + " result_df = pd.concat([dataframe, metrics_df], axis=1)\n", + "\n", + " # calculate precision, recall, and F1 score for all rows\n", + " tp = result_df[\"tp_filtered\"].sum()\n", + " fp = result_df[\"fp_filtered\"].sum()\n", + " fn = result_df[\"fn_filtered\"].sum()\n", + "\n", + " precision = tp / (tp + fp) if (tp + fp) > 0 else 0\n", + " recall = tp / (tp + fn) if (tp + fn) > 0 else 0\n", + " f1_score = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0\n", + "\n", + " print(f\"Precision - filtered: {precision:.4f}\")\n", + " print(f\"Recall -filtered: {recall:.4f}\")\n", + " print(f\"F1 Score - filtered: {f1_score:.4f}\")\n", + "\n", + " return result_df" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "68", + "metadata": {}, + "outputs": [], + "source": [ + "post_result_df = cal_eval_metrics(result_df)\n", + "post_result_df.head()\n", + "# Precision - filtered: 0.8825\n", + "# Recall -filtered: 0.8760\n", + "# F1 Score - filtered: 0.8792" + ] + }, + { + "cell_type": "markdown", + "id": "69", + "metadata": {}, + "source": [ + "## Numerical data detection" + ] + }, + { + "cell_type": "markdown", + "id": "70", + "metadata": {}, + "source": [ + "### Prepare data" + ] + }, + { + "cell_type": "markdown", + "id": "71", + "metadata": {}, + "source": [ + "Since `mailcom` explicitly masks any digits in text, the dataset for this evaluation should fulfill the same requirement. The two datasets used above detect numbers in specific formats, such as license numbers, phone numbers, etc., but not all digits. \n", + "\n", + "We therefore used ATIS dataset for this purpose. Each token in the dataset is annotated with a slot label, ensuring that all digits are included.\n", + "\n", + "The ATIS train dataset (4978 rows) was used for this evaluation instead of the test dataset (893 rows) to have a larger sample size for evaluation. Since `mailcom`'s number detection does not use any machine learning model, using the train dataset would not cause data leakage issues.\n", + "\n", + "Thanks to [Yun-Nung (Vivian) Chen](https://github.com/yvchen/JointSLU) for publishing this dataset. Download the train dataset [here](https://github.com/yvchen/JointSLU/blob/master/data/atis.train.iob)" + ] + }, + { + "cell_type": "markdown", + "id": "72", + "metadata": {}, + "source": [ + "**Note**: *We evaluated `mailcom` on this dataset without comparing with other tools since the other tools do not explicitly mask all digits in text.*" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "73", + "metadata": {}, + "outputs": [], + "source": [ + "# load dataset\n", + "def load_atis_dataset(file_path):\n", + " sentences = []\n", + " tokens = []\n", + " slots = []\n", + " with open(file_path, \"r\") as f:\n", + " for line in f:\n", + " # the line format is BOS tokens EOS slot_labels\n", + " line = line.strip().removeprefix(\"BOS\")\n", + " sentence, slot_labels = line.split(\"EOS\", maxsplit=1)\n", + " sentence = sentence.strip()\n", + " slot_labels = slot_labels.strip()\n", + "\n", + " sentences.append(sentence)\n", + " tokens.append(sentence.split())\n", + " slots.append(slot_labels.split()[1:-1]) # remove the leading and trailing O labels, which correspond to the BOS and EOS tokens\n", + " return pd.DataFrame({\"sentence\": sentences, \"tokens\": tokens, \"slot_labels\": slots})" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "74", + "metadata": {}, + "outputs": [], + "source": [ + "atis_df = load_atis_dataset(\"eval/atis.train.iob\")\n", + "atis_df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "75", + "metadata": {}, + "outputs": [], + "source": [ + "# find slot labels that contain digits\n", + "def has_digits(token: str):\n", + " return any(char.isdigit() for char in token)\n", + "\n", + "# filter rows containing digits in the dataset and also yield the unique slot labels that contain digits\n", + "non_o_digit_slots = {}\n", + "o_digit_slots = {}\n", + "non_o_digit_row_indices = []\n", + "o_digit_row_indices = [] # row indices that contain digits marked as \"O\"\n", + "for idx, row in atis_df.iterrows():\n", + " tokens = row[\"tokens\"]\n", + " slot_labels = row[\"slot_labels\"]\n", + " for token, slot in zip(tokens, slot_labels):\n", + " if has_digits(token):\n", + " if slot == \"O\":\n", + " o_digit_slots[slot] = o_digit_slots.get(slot, 0) + 1\n", + " o_digit_row_indices.append(idx)\n", + " else:\n", + " non_o_digit_slots[slot] = non_o_digit_slots.get(slot, 0) + 1\n", + " non_o_digit_row_indices.append(idx)\n", + "non_o_digit_slots, o_digit_slots" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "76", + "metadata": {}, + "outputs": [], + "source": [ + "non_o_digit_df = atis_df.loc[non_o_digit_row_indices].reset_index(drop=True)\n", + "non_o_digit_df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "77", + "metadata": {}, + "outputs": [], + "source": [ + "o_digit_df = atis_df.loc[o_digit_row_indices].reset_index(drop=True)\n", + "o_digit_df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "78", + "metadata": {}, + "outputs": [], + "source": [ + "# extract number-related tokens from sentences in the filtered datasets\n", + "def extract_numbers(tokens, slot_labels_rows, slot_labels_keys):\n", + " extracted_numbers = []\n", + " for token, slot in zip(tokens, slot_labels_rows):\n", + " if slot in slot_labels_keys and has_digits(token):\n", + " extracted_numbers.append(token)\n", + "\n", + " # sort by token length in descending order\n", + " # to make sure that number replacement work correctly\n", + " extracted_numbers.sort(key=lambda x: len(x), reverse=True)\n", + " return extracted_numbers" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "79", + "metadata": {}, + "outputs": [], + "source": [ + "# add column \"numbers\" to the two filtered dataframes\n", + "non_o_digit_df[\"numbers\"] = non_o_digit_df.apply(lambda row: extract_numbers(row[\"tokens\"], row[\"slot_labels\"], non_o_digit_slots.keys()), axis=1)\n", + "non_o_digit_df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "80", + "metadata": {}, + "outputs": [], + "source": [ + "o_digit_df[\"numbers\"] = o_digit_df.apply(lambda row: extract_numbers(row[\"tokens\"], row[\"slot_labels\"], o_digit_slots.keys()), axis=1)\n", + "o_digit_df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "81", + "metadata": {}, + "outputs": [], + "source": [ + "# combine to get the final dataset for evaluation\n", + "digit_df = pd.concat([non_o_digit_df, o_digit_df], ignore_index=True)\n", + "assert len(digit_df) == len(non_o_digit_df) + len(o_digit_df) == 789\n", + "digit_df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "82", + "metadata": {}, + "outputs": [], + "source": [ + "# convert string to json format\n", + "digit_df[\"tokens\"] = digit_df[\"tokens\"].apply(json.dumps)\n", + "digit_df[\"slot_labels\"] = digit_df[\"slot_labels\"].apply(json.dumps)\n", + "digit_df[\"numbers\"] = digit_df[\"numbers\"].apply(json.dumps)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "83", + "metadata": {}, + "outputs": [], + "source": [ + "# filter duplicated sentences in the dataset\n", + "digit_df = digit_df.drop_duplicates(subset=[\"sentence\", \"tokens\", \"slot_labels\", \"numbers\"]).reset_index(drop=True)\n", + "assert len(digit_df) == 665, f\"Expected 665 unique sentences, but got {len(digit_df)}\"\n", + "digit_df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "84", + "metadata": {}, + "outputs": [], + "source": [ + "# save the processed dataset to a csv file for evaluation\n", + "digit_df.to_csv(\"eval/number_detection_eval.csv\", index=False)" + ] + }, + { + "cell_type": "markdown", + "id": "85", + "metadata": {}, + "source": [ + "### Run `mailcom` on the dataset" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "86", + "metadata": {}, + "outputs": [], + "source": [ + "# load workflow configuration\n", + "new_settings = {\n", + " \"default_lang\": \"en\", # only English in the dataset\n", + " \"pseudo_fields\": [\"content\"], # we don't consider subject here\n", + " \"pseudo_emailaddresses\": False,\n", + " \"pseudo_ne\": False,\n", + " \"pseudo_numbers\": True, # we want to pseudo numbers for this evaluation\n", + " \"datetime_detection\": False,\n", + "}\n", + "\n", + "# save the updated configuration to a file for reproducibility purposes\n", + "new_settings_dir = eval_folder\n", + "workflow_settings = mailcom.get_workflow_settings(new_settings=new_settings, \n", + " updated_setting_dir= new_settings_dir,\n", + " save_updated_settings=True)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "87", + "metadata": {}, + "outputs": [], + "source": [ + "# load csv file into input handler\n", + "input_csv = \"eval/number_detection_eval.csv\"\n", + "# the columns of the csv that should be passed through the processing pipeline/retained in the pipeline\n", + "matching_columns = [\"sentence\"]\n", + "# the predefined keys that should be used to match these columns, in the correct order\n", + "pre_defined_keys = [\"content\"]\n", + "# what to call any columns that are not matched to pre-defined keys\n", + "# get this from the workflow settings\n", + "unmatched_keyword = workflow_settings.get(\"csv_col_unmatched_keyword\")\n", + "\n", + "input_handler = mailcom.get_input_handler(in_path=input_csv, in_type=\"csv\", \n", + " col_names=matching_columns, \n", + " init_data_fields=pre_defined_keys, \n", + " unmatched_keyword=unmatched_keyword)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "88", + "metadata": {}, + "outputs": [], + "source": [ + "# process the input data\n", + "mailcom.process_data(input_handler.get_email_list(), workflow_settings)\n", + "# took around 1.9s on laptop" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "89", + "metadata": {}, + "outputs": [], + "source": [ + "# convert the processed data into a dataframe\n", + "number_df = pd.DataFrame(input_handler.get_email_list())\n", + "number_df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "90", + "metadata": {}, + "outputs": [], + "source": [ + "# only keep the content, pseudo_content, and sentences columns\n", + "filtered_number_df = number_df[[\"content\", \"pseudo_content\", \"sentences\"]]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "91", + "metadata": {}, + "outputs": [], + "source": [ + "# add pseudo_content to the original dataset\n", + "org_number_df = pd.read_csv(input_csv)\n", + "org_number_df[\"tokens\"] = org_number_df[\"tokens\"].apply(json.loads)\n", + "org_number_df[\"slot_labels\"] = org_number_df[\"slot_labels\"].apply(json.loads)\n", + "org_number_df[\"numbers\"] = org_number_df[\"numbers\"].apply(json.loads)\n", + "\n", + "# check before merging\n", + "(org_number_df[\"sentence\"].sort_values().reset_index(drop=True) ==\n", + " filtered_number_df[\"content\"].sort_values().reset_index(drop=True)).all()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "92", + "metadata": {}, + "outputs": [], + "source": [ + "# merge the pseudo_content into the original datasetn_df[\n", + "merged_number_df = org_number_df.merge(filtered_number_df, left_on=\"sentence\", right_on=\"content\", how=\"inner\", validate=\"one_to_one\")\n", + "\n", + "assert len(merged_number_df) == len(org_number_df) == len(filtered_number_df), f\"Expected unchanged number of items after merging, but got {len(merged_number_df)}\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "93", + "metadata": {}, + "outputs": [], + "source": [ + "merged_number_df = merged_number_df.drop(columns=[\"content\"])\n", + "merged_number_df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "94", + "metadata": {}, + "outputs": [], + "source": [ + "# create expected content column as ground truth for evaluation\n", + "def number_to_placeholder(word):\n", + " characters = list(word)\n", + " new_characters = []\n", + " for i, char in enumerate(characters):\n", + " if char.isdigit():\n", + " if i == 0 or not characters[i-1].isdigit():\n", + " new_characters.append(\"[number]\")\n", + " else:\n", + " new_characters.append(char)\n", + " return \"\".join(new_characters)\n", + "\n", + "def replace_number_with_placeholder(sentence, numbers_list):\n", + " replaced_sentence = sentence\n", + " for number in numbers_list:\n", + " replaced_sentence = replaced_sentence.replace(number, number_to_placeholder(number))\n", + " return replaced_sentence\n", + "\n", + "merged_number_df[\"expected_content\"] = merged_number_df.apply(lambda row: replace_number_with_placeholder(row[\"sentence\"], row[\"numbers\"]), axis=1)\n", + "merged_number_df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "95", + "metadata": {}, + "outputs": [], + "source": [ + "# calculate the exact match accuracy between the pseudo_content and the expected_content\n", + "merged_number_df[\"exact_match\"] = merged_number_df.apply(lambda row: row[\"pseudo_content\"] == row[\"expected_content\"], axis=1)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "96", + "metadata": {}, + "outputs": [], + "source": [ + "# calculate tp, fp, fn\n", + "def calculate_tp_fp_fn(row):\n", + " pred_count = len(re.findall(r\"\\[number\\]\", row[\"pseudo_content\"]))\n", + " gold_count = len(re.findall(r\"\\[number\\]\", row[\"expected_content\"]))\n", + "\n", + " true_positives = min(pred_count, gold_count)\n", + " false_positives = max(pred_count - gold_count, 0)\n", + " false_negatives = max(gold_count - pred_count, 0)\n", + " \n", + " return pd.Series({\"tp\": true_positives, \"fp\": false_positives, \"fn\": false_negatives})\n", + "\n", + "metrics_df = merged_number_df.apply(calculate_tp_fp_fn, axis=1)\n", + "merged_number_df = pd.concat([merged_number_df, metrics_df], axis=1)\n", + "merged_number_df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "97", + "metadata": {}, + "outputs": [], + "source": [ + "# save the results to a csv file for further analysis\n", + "merged_number_df.to_csv(\"eval/number_detection_results.csv\", index=False)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "98", + "metadata": {}, + "outputs": [], + "source": [ + "def cal_eval_metrics(df):\n", + " total_tp = df[\"tp\"].sum()\n", + " total_fp = df[\"fp\"].sum()\n", + " total_fn = df[\"fn\"].sum()\n", + "\n", + " precision = total_tp / (total_tp + total_fp) if (total_tp + total_fp) > 0 else 0\n", + " recall = total_tp / (total_tp + total_fn) if (total_tp + total_fn) > 0 else 0\n", + " f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0\n", + " \n", + " accuracy = df[\"exact_match\"].mean()\n", + " print(f\"Exact match accuracy: {accuracy:.4f}\")\n", + " print(f\"Micro Precision: {precision:.4f}\")\n", + " print(f\"Micro Recall: {recall:.4f}\")\n", + " print(f\"Micro F1 Score: {f1:.4f}\")\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "99", + "metadata": {}, + "outputs": [], + "source": [ + "cal_eval_metrics(merged_number_df)\n", + "# Exact match accuracy: 1.0000\n", + "# Micro Precision: 1.0000\n", + "# Micro Recall: 1.0000\n", + "# Micro F1 Score: 1.0000" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "mailcom", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.9" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/docs/source/notebooks/test_other_tools.ipynb b/docs/source/notebooks/test_other_tools.ipynb new file mode 100644 index 0000000..f4be579 --- /dev/null +++ b/docs/source/notebooks/test_other_tools.ipynb @@ -0,0 +1,709 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "0", + "metadata": {}, + "source": [ + "# This notebook is used to test other pseudonymization tools\n", + "\n", + "Tools to compare with `mailcom`:\n", + "- [Presidio](https://github.com/microsoft/presidio/)\n", + "- [Scrubadub](https://github.com/LeapBeyond/scrubadub)" + ] + }, + { + "cell_type": "markdown", + "id": "1", + "metadata": {}, + "source": [ + "## Install tools if needed\n", + "\n", + "Python 3.10" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2", + "metadata": {}, + "outputs": [], + "source": [ + "%pip install scrubadub" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3", + "metadata": {}, + "outputs": [], + "source": [ + "%pip install scrubadub_spacy scrubadub_stanford" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4", + "metadata": {}, + "outputs": [], + "source": [ + "%pip install \"presidio_analyzer[transformers]\"\n", + "%pip install presidio_anonymizer\n", + "# python -m spacy download en_core_web_sm" + ] + }, + { + "cell_type": "markdown", + "id": "5", + "metadata": {}, + "source": [ + "## Try out the tools" + ] + }, + { + "cell_type": "markdown", + "id": "6", + "metadata": {}, + "source": [ + "### Presidio" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7", + "metadata": {}, + "outputs": [], + "source": [ + "from presidio_analyzer import AnalyzerEngine\n", + "from presidio_analyzer.nlp_engine import TransformersNlpEngine\n", + "from presidio_anonymizer import AnonymizerEngine\n", + "from presidio_analyzer.nlp_engine import NlpEngineProvider" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8", + "metadata": {}, + "outputs": [], + "source": [ + "text = \"\"\"\n", + "Dear Dr. Emma Müller,\n", + "\n", + "I hope this email finds you well.\n", + "\n", + "I'm writing to you today from NextAI AG in Munich, Germany. We're keen to discuss the exciting developments from the recent Heidelberg AI Summit 2025. Specifically, we were very interested in the presentation on \"Next-Generation Robotics\" that took place in Room 306 of the main convention center.\n", + "\n", + "Our team at NextAI AG would love to set up a quick call to discuss potential collaborations following the insights shared. We're thinking of a brief chat around July 15th.\n", + "\n", + "Please let us know if July 15th works for you, or suggest an alternative time.\n", + "\n", + "Best regards,\n", + "Max Schneider\n", + "\n", + "--- Forwarded message ---\n", + "From: info@heidelbergaisummit.de\n", + "Date: Friday, 27 June 2025 at 14:30:06\n", + "Subject: Heidelberg AI Summit 2025 - Thank You!\n", + "\n", + "Dear Attendees,\n", + "\n", + "Thank you for making the Heidelberg AI Summit 2025 a resounding success! We truly appreciate your participation and engagement. We look forward to seeing you at future events.\n", + "\n", + "Sincerely, \n", + "The Heidelberg AI Summit Team\n", + "\n", + "\"\"\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9", + "metadata": {}, + "outputs": [], + "source": [ + "# run with custom model config file\n", + "\n", + "config_content = \"\"\"\n", + "nlp_engine_name: transformers\n", + "models:\n", + " - lang_code: en\n", + " model_name:\n", + " spacy: en_core_web_sm\n", + " transformers: xlm-roberta-large-finetuned-conll03-english\n", + " model_kwargs:\n", + " revision: \"18f95e9\"\n", + "ner_model_configuration:\n", + "labels_to_ignore:\n", + "- O\n", + "model_to_presidio_entity_mapping:\n", + " PER: PERSON\n", + " LOC: LOCATION\n", + " ORG: ORGANIZATION\n", + " MISC: MISC\n", + "low_confidence_score_multiplier: 0.4\n", + "low_score_entity_names: []\n", + "\"\"\"\n", + "\n", + "# save config to a yaml file\n", + "with open(\"config.yaml\", \"w\") as f:\n", + " f.write(config_content)\n", + "\n", + "# Create NLP engine based on configuration file\n", + "provider = NlpEngineProvider(conf_file=\"config.yaml\")\n", + "nlp_engine = provider.create_engine()\n", + "\n", + "# Set up the engine, loads the NLP module (spaCy model by default) \n", + "# and other PII recognizers\n", + "analyzer = AnalyzerEngine(nlp_engine=nlp_engine)\n", + "\n", + "# Call analyzer to get results\n", + "results = analyzer.analyze(text=text, language='en')\n", + "print(results)\n", + "\n", + "# Analyzer results are passed to the AnonymizerEngine for anonymization\n", + "\n", + "anonymizer = AnonymizerEngine()\n", + "\n", + "anonymized_text = anonymizer.anonymize(text=text, analyzer_results=results)\n", + "\n", + "print(anonymized_text)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "10", + "metadata": {}, + "outputs": [], + "source": [ + "# run with transformers model name\n", + "\n", + "# Define which transformers model to use\n", + "model_config = [{\"lang_code\": \"en\", \"model_name\": {\n", + " \"spacy\": \"en_core_web_sm\", # use a small spaCy model for lemmas, tokens etc.\n", + " \"transformers\": \"xlm-roberta-large-finetuned-conll03-english\"\n", + " },\n", + " \"model_kwargs\": {\n", + " \"revision\": \"18f95e9\",\n", + " }\n", + "}]\n", + "\n", + "nlp_engine = TransformersNlpEngine(models=model_config)\n", + "\n", + "# Set up the engine, loads the NLP module (spaCy model by default) \n", + "# and other PII recognizers\n", + "analyzer = AnalyzerEngine(nlp_engine=nlp_engine)\n", + "\n", + "# Call analyzer to get results\n", + "results = analyzer.analyze(text=text, language='en')\n", + "print(results)\n", + "\n", + "# Analyzer results are passed to the AnonymizerEngine for anonymization\n", + "\n", + "anonymizer = AnonymizerEngine()\n", + "\n", + "anonymized_text = anonymizer.anonymize(text=text, analyzer_results=results)\n", + "\n", + "print(anonymized_text)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "11", + "metadata": {}, + "outputs": [], + "source": [ + "for item in results:\n", + " print(f\"Entity: {item.entity_type}, Start: {item.start}, End: {item.end}, Score: {item.score}\")" + ] + }, + { + "cell_type": "markdown", + "id": "12", + "metadata": {}, + "source": [ + "### Scrubadub" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "13", + "metadata": {}, + "outputs": [], + "source": [ + "import scrubadub, scrubadub_spacy, scrubadub_stanford # scrubadub_address requires additional setup, see https://scrubadub.readthedocs.io/en/stable/addresses.html" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "14", + "metadata": {}, + "outputs": [], + "source": [ + "text = \"\"\"\n", + "Dear Dr. Emma Müller,\n", + "\n", + "I hope this email finds you well.\n", + "\n", + "I'm writing to you today from NextAI AG in Munich, Germany. We're keen to discuss the exciting developments from the recent Heidelberg AI Summit 2025. Specifically, we were very interested in the presentation on \"Next-Generation Robotics\" that took place in Room 306 of the main convention center.\n", + "\n", + "Our team at NextAI AG would love to set up a quick call to discuss potential collaborations following the insights shared. We're thinking of a brief chat around July 15th.\n", + "\n", + "Please let us know if July 15th works for you, or suggest an alternative time.\n", + "\n", + "Best regards,\n", + "Max Schneider\n", + "\n", + "--- Forwarded message ---\n", + "From: info@heidelbergaisummit.de\n", + "Date: Friday, 27 June 2025 at 14:30:06\n", + "Subject: Heidelberg AI Summit 2025 - Thank You!\n", + "\n", + "Dear Attendees,\n", + "\n", + "Thank you for making the Heidelberg AI Summit 2025 a resounding success! We truly appreciate your participation and engagement. We look forward to seeing you at future events.\n", + "\n", + "Sincerely, \n", + "The Heidelberg AI Summit Team\n", + "\n", + "\"\"\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "15", + "metadata": {}, + "outputs": [], + "source": [ + "# add external detectors\n", + "scrubber = scrubadub.Scrubber()\n", + "\n", + "# only use one at a time, otherwise the pseudonymized text will have multiple tags for the same entity\n", + "scrubber.add_detector(scrubadub_spacy.detectors.SpacyEntityDetector)\n", + "# scrubber.add_detector(scrubadub_spacy.detectors.SpacyNameDetector)\n", + "\n", + "# adding the below detectors make the process non-stop running, don't know why\n", + "# scrubber.add_detector(scrubadub_stanford.detectors.StanfordEntityDetector)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "16", + "metadata": {}, + "outputs": [], + "source": [ + "pseudonymized_text = scrubber.clean(text)\n", + "print(pseudonymized_text)" + ] + }, + { + "cell_type": "markdown", + "id": "17", + "metadata": {}, + "source": [ + "## Run Presidio on Hugging Face datasets" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "18", + "metadata": {}, + "outputs": [], + "source": [ + "# import if needed\n", + "from presidio_analyzer import AnalyzerEngine\n", + "from presidio_analyzer.nlp_engine import TransformersNlpEngine\n", + "\n", + "import pandas as pd\n", + "import json\n", + "import re" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "19", + "metadata": {}, + "outputs": [], + "source": [ + "# run with transformers model name\n", + "\n", + "# Define which transformers model to use\n", + "model_config = [{\"lang_code\": \"en\", \"model_name\": {\n", + " \"spacy\": \"en_core_web_md\", # use a medium spaCy model to match setting of mailcom\n", + " \"transformers\": \"xlm-roberta-large-finetuned-conll03-english\"\n", + " },\n", + " \"model_kwargs\": {\n", + " \"revision\": \"18f95e9\",\n", + " }\n", + "}]\n", + "\n", + "nlp_engine = TransformersNlpEngine(models=model_config)\n", + "\n", + "# Set up the engine, loads the NLP module (spaCy model by default) \n", + "# and other PII recognizers\n", + "analyzer = AnalyzerEngine(nlp_engine=nlp_engine)\n" + ] + }, + { + "cell_type": "markdown", + "id": "20", + "metadata": {}, + "source": [ + "### Email dataset\n", + "\n", + "We used the same dataset as in `quantitative_eval.ipynb` to compare the results of `mailcom` with Presidio.\n", + "\n", + "Run the Prepare data section in `quantitative_eval.ipynb` to get the `\"eval/email_detection_eval.csv\"` before running this section." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "21", + "metadata": {}, + "outputs": [], + "source": [ + "# load the dataset\n", + "email_df = pd.read_csv(\"eval/email_detection_eval.csv\")\n", + "# deserialize JSON back to Python objects for emails column\n", + "email_df[\"emails\"] = email_df[\"emails\"].apply(json.loads)\n", + "email_df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "22", + "metadata": {}, + "outputs": [], + "source": [ + "# Call analyzer to get results\n", + "def analyze_row(row):\n", + " text = row[\"text\"]\n", + " results = analyzer.analyze(text=text, language=\"en\")\n", + " saved_results = []\n", + " for item in results:\n", + " if item.entity_type == \"EMAIL_ADDRESS\":\n", + " saved_results.append(text[item.start:item.end])\n", + "\n", + " return saved_results" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "23", + "metadata": {}, + "outputs": [], + "source": [ + "# add a new column to the dataframe with the analyzer results\n", + "email_df[\"presidio_emails\"] = email_df.apply(analyze_row, axis=1) # 28m 44.5s on laptop" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "24", + "metadata": {}, + "outputs": [], + "source": [ + "# create expected text and actual text after replacing email by [email]\n", + "def replace_emails_with_placeholder(text, emails):\n", + " for email in emails:\n", + " text = text.replace(email, \"[email]\")\n", + " return text\n", + "\n", + "email_df[\"expected_text\"] = email_df.apply(lambda row: replace_emails_with_placeholder(row[\"text\"], row[\"emails\"]), axis=1)\n", + "email_df[\"presidio_text\"] = email_df.apply(lambda row: replace_emails_with_placeholder(row[\"text\"], row[\"presidio_emails\"]), axis=1)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "25", + "metadata": {}, + "outputs": [], + "source": [ + "email_df[\"exact_match\"] = email_df.apply(lambda row: row[\"expected_text\"] == row[\"presidio_text\"], axis=1)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "26", + "metadata": {}, + "outputs": [], + "source": [ + "# calculate tp, fp, fn\n", + "def calculate_tp_fp_fn(row):\n", + " pred_count = len(re.findall(r\"\\[email\\]\", row[\"presidio_text\"]))\n", + " gold_count = len(re.findall(r\"\\[email\\]\", row[\"expected_text\"]))\n", + "\n", + " true_positives = min(pred_count, gold_count)\n", + " false_positives = max(pred_count - gold_count, 0)\n", + " false_negatives = max(gold_count - pred_count, 0)\n", + " \n", + " return pd.Series({\"tp\": true_positives, \"fp\": false_positives, \"fn\": false_negatives})\n", + "\n", + "metrics_df = email_df.apply(calculate_tp_fp_fn, axis=1)\n", + "email_df = pd.concat([email_df, metrics_df], axis=1)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "27", + "metadata": {}, + "outputs": [], + "source": [ + "# save the results to a csv file for further analysis\n", + "email_df.to_csv(\"eval/presidio_email_detection_results_revision.csv\", index=False)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "28", + "metadata": {}, + "outputs": [], + "source": [ + "def cal_eval_metrics(df):\n", + " total_tp = df[\"tp\"].sum()\n", + " total_fp = df[\"fp\"].sum()\n", + " total_fn = df[\"fn\"].sum()\n", + "\n", + " precision = total_tp / (total_tp + total_fp) if (total_tp + total_fp) > 0 else 0\n", + " recall = total_tp / (total_tp + total_fn) if (total_tp + total_fn) > 0 else 0\n", + " f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0\n", + " \n", + " accuracy = df[\"exact_match\"].mean()\n", + " print(f\"Exact match accuracy: {accuracy:.4f}\")\n", + " print(f\"Micro Precision: {precision:.4f}\")\n", + " print(f\"Micro Recall: {recall:.4f}\")\n", + " print(f\"Micro F1 Score: {f1:.4f}\")\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "29", + "metadata": {}, + "outputs": [], + "source": [ + "cal_eval_metrics(email_df)\n", + "# Exact match accuracy: 0.9995\n", + "# Micro Precision: 1.0000\n", + "# Micro Recall: 0.9995\n", + "# Micro F1 Score: 0.9998\n", + "# There is only one failed case:\n", + "# -- sentence: With an email address that reflects her professional prowess-bhamini.gulati@shukla.biz-she navigates the digital realm with ease.\n", + "# -- expected content: With an email address that reflects her professional prowess-[email]-she navigates the digital realm with ease.\n", + "# -- Presidio did not recognize the email." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "30", + "metadata": {}, + "outputs": [], + "source": [ + "email_df[email_df[\"exact_match\"] == False]" + ] + }, + { + "cell_type": "markdown", + "id": "31", + "metadata": {}, + "source": [ + "### NER dataset\n", + "\n", + "We used the same dataset as in `quantitative_eval.ipynb` to compare the results of `mailcom` with Presidio.\n", + "\n", + "Run the Prepare data section in `quantitative_eval.ipynb` to get the `\"eval/ner_detection_eval.csv\"` before running this section." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "32", + "metadata": {}, + "outputs": [], + "source": [ + "# load the dataset\n", + "ner_df = pd.read_csv(\"eval/ner_detection_eval.csv\")\n", + "# deserialize JSON back to Python objects for entities and tokens columns\n", + "ner_df[\"entities\"] = ner_df[\"entities\"].apply(json.loads)\n", + "ner_df[\"tokens\"] = ner_df[\"tokens\"].apply(json.loads)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "33", + "metadata": {}, + "outputs": [], + "source": [ + "type_mapping = {\n", + " \"PERSON\": \"PER\",\n", + " \"LOCATION\": \"LOC\",\n", + " \"ORGANIZATION\": \"ORG\",\n", + "} # Presidio does not consider MISC type" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "34", + "metadata": {}, + "outputs": [], + "source": [ + "# Call analyzer to get results\n", + "def analyze_row(row):\n", + " text = row[\"sentence\"]\n", + " results = analyzer.analyze(text=text, language=\"en\")\n", + " saved_results = []\n", + " for item in results:\n", + " if item.entity_type in type_mapping:\n", + " entity_type = type_mapping[item.entity_type]\n", + " else:\n", + " entity_type = item.entity_type\n", + " saved_results.append({\n", + " \"type\": entity_type,\n", + " \"start\": item.start,\n", + " \"end\": item.end,\n", + " \"text\": text[item.start:item.end],\n", + " })\n", + "\n", + " return saved_results" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "35", + "metadata": {}, + "outputs": [], + "source": [ + "# add a new column to the dataframe with the analyzer results\n", + "ner_df[\"presidio_results\"] = ner_df.apply(analyze_row, axis=1) # 19 minutes 4.2 seconds on laptop" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "36", + "metadata": {}, + "outputs": [], + "source": [ + "# calculate tp, fp, fn for each row\n", + "def calculate_tp_fp_fn(row):\n", + " gold_data = row[\"entities\"]\n", + " pred_data = row[\"presidio_results\"]\n", + "\n", + " # normalize data to get tuple of (type, text, start, end)\n", + " norm_gold_data = {\n", + " (e[\"type\"], e[\"text\"], e[\"start\"], e[\"end\"]) for e in gold_data if e[\"type\"] in type_mapping.values()\n", + " }\n", + " norm_pred_data = {\n", + " (e[\"type\"], e[\"text\"], e[\"start\"], e[\"end\"]) for e in pred_data if e[\"type\"] in type_mapping.values()\n", + " }\n", + "\n", + " tp = len(norm_gold_data & norm_pred_data)\n", + " fp = len(norm_pred_data - norm_gold_data)\n", + " fn = len(norm_gold_data - norm_pred_data)\n", + "\n", + " return pd.Series({\"tp\": tp, \"fp\": fp, \"fn\": fn})" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "37", + "metadata": {}, + "outputs": [], + "source": [ + "def cal_eval_metrics(dataframe):\n", + " # calculate precision, recall, and F1 score for all rows\n", + " tp = fp = fn = 0\n", + "\n", + " # add tp, fp, fn columns to the dataframe\n", + " metrics_df = dataframe.apply(calculate_tp_fp_fn, axis=1)\n", + " result_df = pd.concat([dataframe, metrics_df], axis=1)\n", + "\n", + " for _, row in result_df.iterrows():\n", + " tp += row[\"tp\"]\n", + " fp += row[\"fp\"]\n", + " fn += row[\"fn\"]\n", + "\n", + " precision = tp / (tp + fp) if (tp + fp) > 0 else 0\n", + " recall = tp / (tp + fn) if (tp + fn) > 0 else 0\n", + " f1_score = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0\n", + "\n", + " print(f\"Precision: {precision:.4f}\")\n", + " print(f\"Recall: {recall:.4f}\")\n", + " print(f\"F1 Score: {f1_score:.4f}\")\n", + "\n", + " return result_df" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "38", + "metadata": {}, + "outputs": [], + "source": [ + "result_df = cal_eval_metrics(ner_df)\n", + "result_df.head()\n", + "# Precision: 0.6230\n", + "# Recall: 0.4985\n", + "# F1 Score: 0.5538" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "39", + "metadata": {}, + "outputs": [], + "source": [ + "# save the results to a csv file for further analysis\n", + "result_df.to_csv(\"eval/presidio_ner_detection_results_revision.csv\", index=False)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "test-mailcom", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.20" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/evaluation_strategy.md b/evaluation_strategy.md new file mode 100644 index 0000000..6e2eea4 --- /dev/null +++ b/evaluation_strategy.md @@ -0,0 +1,165 @@ +# Evaluation Strategy for `mailcom` + +Given the limited availability of free email and text redaction corpora with ground-truth annotations for email addresses, named entities, and numbers, we assessed `mailcom` using both qualitative and quantitative methods for each transformation step. The evaluation strategy is outlined as follows: + +## Qualitative Evaluation + +For the qualitative evaluation, we focused on the utility and performance of `mailcom` in identifying and pseudonymizing sensitive information on sample data. We also compared `mailcom`'s performance against other open-source pseudonymization tools, including [QualiAnon](https://github.com/pangaea-data-publisher/qualianon), [Amnesia](https://github.com/dTsitsigkos/Amnesia), [Presidio](https://github.com/microsoft/presidio/), and [Scrubadub](https://github.com/LeapBeyond/scrubadub) + +### `mailcom` alone + +Our sample data was provided by the research group of Sybille Große and the [email donors](https://mailcom.rose.uni-heidelberg.de/), including a set of four short emails in `eml` format and a `csv` file of 103 email contents. We applied `mailcom` to this sample data and manually reviewed the outputs to assess the effectiveness of the pseudonymization process. The results are summarized as follows: +- **Accuracy**: `mailcom` correctly identified and pseudonymized email addresses, numbers, and most of named entities, including people names, organization names, and location names. However, there are some misaligned NER cases, which would be elaborated later. +- **Running Time**: with default settings, we ran `mailcom` on an Intel Core Ultra 7 laptop, 32GB RAM (no GPU), and obtained the following running time: + - For the 4 `eml` files: around 6.7 seconds + - For the 103 email contents in the `csv` file: around 5 minutes 59.5 seconds, i.e ~ 3.49 seconds/row + +#### Some misaligned cases + +In the French and Spanish samples among the four short eml files, we observed misalignments in NER results: + +* *Location not fully detected*: In the French sample sentence, `"Adresse : 123, rue Principale, 12345 Ville Modèle"`, the default NER model did not recognize `"Principale"` as part of a location entity. +* *Incorrect MISC tagging*: In another French sample sentence,`"April 2024 um 16:53:37 MESZ"`, the substring `"ESZ"` within `"MESZ"` was incorrectly labeled as MISC. This is attributed to a language mismatch, since the email content is in French while the current sentence is in German. +* *Sentence segmentation issue*: In both the Spanish and French samples, the first sentences were not split as expected. For example, the Spanish text `"El mié., 17 abr. 2024 17:20:18 +0200, Alejandro Rodriguez escribió"` should be treated as a single sentence, but it was incorrectly split into three segments: `"El mié.,"`, `"17 abr."`, and `"2024 17:20:18 +0200, Alejandro Rodriguez escribió"`. + * One reason for this is the incorrect sentence splitting by the spaCy models. + * Another reason is the language mismatch between the used spaCy model and the considered sentences, since some emails have multiple languages. For instance, the French email sample and some French email contents in the `csv` file start with a German sentence. + +### `mailcom` vs. other open-source pseudonymization tools + +We used the four short sample emails to test similar pseudonymization tools. + +#### QualiAnon + +Java17 JDK from amazon, Corretto-17.0.8.8.1 version is required to run `QualiAnon`. However, according to their [video tutorial](https://www.youtube.com/watch?v=RYQn4DjdKmo), QualiAnon is not fully automated. The semi-automated part happens where the program replaces masked tokens after user manually defines type and replacement for these tokens. Besides, QualiAnon supports only docx format at the moment and is not designed for large projects (i.e. more than 100 transcripts). It is also unclear if QualiAnon offers general support for different languages. Therefore, we did not further test QualiAnon for our sample data. + +#### Amnesia + +Amnesia also needs to run with Java setup so we tried the [Amnesia playground website](https://amnesia.openaire.eu/demo/mywizard.html) for our sample emails. + +Unfortunately, we were unable to use Amnesia for our purpose due to formatting constraints. Specifically, Amnesia requires the input data to be provided as a table with a specific delimiter (e.g. `,`). The tool supports four possible [input table formats](https://www.youtube.com/watch?v=vZbU0n6n01c): +* Simple table: Columns may contain different data types, with each cell holding a single value, e.g. a name or a number +* Table with a set-valued attribute: A fixed number of columns where each column contains a set of values. A delimiter for the set must be specified, e.g. `|` +* Set of values: An arbitrary number of values of the same type. +* Disk-based simple table: Intended for very large datasets, as described in their tutorial video. + +None of these options were suitable for our sample data structure, where each email content is a long text containing multiple sentences and is stored in a single cell of a `csv` file. + +#### Presidio + +We first tried to anonymize the English sample email on the [Presidio demo webpage](https://huggingface.co/spaces/presidio/presidio_demo). None of their provided models were able to cover all named entities and email addresses. For instance: +* Model `spaCy/en_core_web_lg` overlooked organization and event entities, which should be tagged as `ORG` and `MISC`, respectively. +* Model `flair/ner-english-large` missed the event entity. +* Model `HuggingFace/obi/deid_roberta_i2b2` did not recognize the second person entity and misclassified the event entity as organization. +* Model `HuggingFace/StanfordAIMI/stanford-deidentifier-base` mislabeled a normal phrase as an organization and misclassified the event entity as an organization. +* Model `stanza/en` ignored the organization and event entities. + +It seems that Presidio by default does not consider arbitrary numbers as sensitive information, which explains why only numbers in date format were detected by some above mentioned models. We therefore discarded the number comparison in our summary above. + +We also installed Presidio to use its anonymizer with the same Transformer model that we used in our work for NER, i.e. `xlm-roberta-large-finetuned-conll03-english` (see `docs/source/notebooks/test_other_tools.ipynb` notebook). However, the pseudonymized text still could not cover all the expected entities, leaving the second person name and event name (MISC) unmasked. + +It is worth noting that under-pseudonymization, especially for person names, can lead to significant privacy risks, while over-pseudonymization can reduce the utility of the data. + +#### Scrubadub + +By default, Scrubadub only detects email addresses and phone numbers. According to their [documentation](https://scrubadub.readthedocs.io/en/stable/usage.html#adding-an-optional-or-external-detector), user can add custom detectors for address and entity detection. However: +* Adding address detector faced installation issues (Python 3.10 was used) +* Location and event entity were ignored by spaCy entity or name detector + +## Quantitative Evaluation + +Since there is no available dataset with ground-truth annotations for email addresses, named entities, and numbers, we evaluated each transformation step of `mailcom` separately on relevant benchmark datasets, including email address detection, NER, and number detection. The evaluation scripts are available in the `docs/source/notebooks/quantitative_eval.ipynb` notebook. + +Additionally, we compared the results of `mailcom` with `Presidio` for email address detection and NER only, since Presidio and other open-source pseudonymization tools do not explicitly mask all numbers in a text. + +In summary, we observed that mailcom yielded comparable results to `Presidio` for email address detection and outperformed `Presidio` in NER while using the same Transformer model. For number detection, `mailcom` achieved absolute accuracy, precision, recall, and F1 score. Below are the detailed evaluation results for each transformation step. + +### Email address detection + +#### Dataset + +We used [Hugging Face `Josephgflowers/PII-NER`](https://huggingface.co/datasets/Josephgflowers/PII-NER) dataset for this evaluation. This dataset is designed for training and evaluating NER models for PII detection. The dataset contains a prompt guiding to extract PII, a multi-paragraph text, and extracted PII entities, such as student name, email address, phone number, driving license, etc. We focused on the email address detection part for our evaluation, hence using only the text and extracted email address entities. + +#### Evaluation results + +Since we used regular expressions for email address detection, we evaluated the performance of `mailcom` using: + +* accuracy: exact match after replacing email addresses by placeholders, +* precision: the proportion of number of detected email addresses that are correct, +* recall: the proportion of actual number of email addresses that are detected, +* F1 score: the harmonic mean of precision and recall. + +The evaluation results are as follows: + +| Metric | `mailcom` | `Presidio` | +|-----------|---------|----------| +| Accuracy | 99.95% | 99.95% | +| Precision | 1.0 | 1.0 | +| Recall | 1.0 | 0.9995 | +| F1 Score | 1.0 | 0.9998 | + +### Named entity detection + +#### Dataset + +For this transformation step, we evaluated on the [Hugging Face `Babelscape/wikineural`](https://huggingface.co/datasets/Babelscape/wikineural) dataset. This dataset contains token-level annotations for named entities, including people names, organization names, location names, and miscellaneous entities, specifically: + +``` +{'O': 0, 'B-PER': 1, 'I-PER': 2, 'B-ORG': 3, 'I-ORG': 4, 'B-LOC': 5, 'I-LOC': 6, 'B-MISC': 7, 'I-MISC': 8} +``` + +We first extracted the named entities and their types from the dataset, then evaluated the performance of `mailcom` using precision, recall, and F1 score across entity types. + +#### Evaluation results + +While considering all entity types, `mailcom` achieved the following results: + +| Metric | `mailcom` | `mailcom` on single-sentence input +|-----------|---------|---------| +| Precision | 0.7707 | 0.7743 | +| Recall | 0.8401 | 0.8431 | +| F1 Score | 0.8039 | 0.8072 | + +Here, the single-sentence input means that we applied `mailcom` on cases where the `spaCy` sentencizer correctly retains the original sentence, which occurs in around 99.4% of the cases in the dataset. + +For remaining cases, where the original sentence is split into two or three sentences, the results are lower, as specified in the notebook. This is due to the misalignment between the original sentence and the sentence after splitting, which leads to incorrect index matching for the detected entities. +* An NER is considered as correct if the detected entity type, text, and start and end indices all match the gold entity. Therefore, even if the detected entity type and text are correct, the misalignment in sentence splitting can cause the start and end indices to be incorrect, resulting in a false negative. +* Some example cases where spaCy sentencizer misaligned the sentences are when the original sentence contains dots in between, such as `No. 1`, `aff.`, `sp.`. + +To compare the results with `Presidio` while using the same Transformer model for NER, i.e. `xlm-roberta-large-finetuned-conll03-english`, we only considered the entity types that are commonly detected by both `mailcom` and `Presidio`, which are `PER`, `ORG`, and `LOC`. + +| Metric | `mailcom` | `Presidio` | +|-----------|---------|----------| +| Precision | 0.8825 | 0.6230 | +| Recall | 0.8760 | 0.4985 | +| F1 Score | 0.8792 | 0.5538 | + +Here, the results of `Presidio` are substantially lower than those of `mailcom`. We did not inspect `Presidio`'s source code and suspect that this discrepancy can be attributed to the default configuration of `Presidio` and the used Transformer model. In [one of the evaluations](https://github.com/microsoft/presidio-research/blob/master/notebooks/5_Evaluate_Custom_Presidio_Analyzer.ipynb) published by `Presidio` on their synthetic datasets, using `StanfordAIMI/stanford-deidentifier-base` model, they achieved around 0.87 precision and 0.84 recall. Therefore, comparing `mailcom` and `Presidio` using other Transformer models for NER, such as `StanfordAIMI/stanford-deidentifier-base`, can be a future direction to further investigate the performance of both tools. + +### Number detection + +#### Dataset + +Since `mailcom` explicitly masks any digits in text, the dataset for this evaluation should fulfill the same requirement. The two datasets used above detect numbers in specific formats, such as license numbers, phone numbers, etc., but not all digits. + +We therefore used ATIS dataset for this purpose. Each token in the dataset is annotated with a slot label, ensuring that all digits are included. + +The ATIS train dataset (4978 rows) was used for this evaluation instead of the test dataset (893 rows) to have a larger sample size for evaluation. Since `mailcom`'s number detection does not use any machine learning model, using the train dataset would not cause data leakage issues. + +Thanks to [Yun-Nung (Vivian) Chen](https://github.com/yvchen/JointSLU) for publishing this dataset. Download the train dataset [here](https://github.com/yvchen/JointSLU/blob/master/data/atis.train.iob) + +#### Evaluation results + +| Metric | `mailcom` | +|-----------|---------| +| Accuracy | 100% | +| Precision | 1.0 | +| Recall | 1.0 | +| F1 Score | 1.0 | + +As `mailcom` explicitly detects any digit character, it achieved absolute accuracy, precision, recall, and F1 score, as expected. We did not compare with `Presidio` for this transformation step since `Presidio` and other open-source pseudonymization tools do not explicitly mask all numbers in a text. + +To reproduce the evaluation results, please refer to the `docs/source/notebooks/quantitative_eval.ipynb` notebook. + +## Future development + +Our future work includes developing a ground-truth annotated dataset and parallelizing the pseudonymization process to further evaluate and improve `mailcom`'s performance. \ No newline at end of file