Blog

Does artificial intelligence deceive?

Poyraz Umut Hatipoğlu
Poyraz Umut Hatipoğlu
-Dec 15, 2022

Introduction & Theory

Have you ever heard of Benford’s law? This law, which may have escaped the attention of even those who are interested in data science, has been utilized for almost a century. Perhaps one of the most interesting areas of use of this law, which is used for dozens of different purposes today, is fraud analysis. Although the concept of fraud is identified with human beings, this concept, which is based on deceiving others, is frequently used by many living things at different intelligence and consciousness levels in nature.

What about today’s sophisticated robots or software equipped with artificial intelligence, do they deceive like living creatures? Or rather, an artificial intelligence application designed not to commit fraud attempts to commit fraud? I hope artificial intelligence is not aware of Benford’s law because today we will test the consistency of the data produced by artificial intelligence with Benford’s law.

This law, named after the American electrical engineer and physicist Frank Albert Benford, was also called the “first digit law” [1] and allowed to examine of the first digits of the numbers obtained in many fields for years. With the study published in 1995 [2], Benford’s law was generalized for the use of numbers not only for the first digits but also for the next digits.

The distribution probability of the digits of this generalized law is presented in [2] as follows:

The artificial intelligence outputs obtained in Intenseye will be examined using the Table presented in [3], which has independently listed the probability distributions of the numbers in the first four digits using Benford’s law.

In [5] the standard deviation for each digit’s distribution is formulated as:

where pi is the expected proportion of a particular digit based on Benford’s law and n is the number of observations. Then the confidence interval can be calculated as follows:

where z denotes the z-statistic and takes value depending on the given p-value.

Conditions & Constraints

There are a few constraints and consistency suggestions that we should pay attention to when examining the different digits of numbers. According to the suggestions of William Goodman, who has published studies on Benford’s law [4], the following conditions must be met as much as possible in order to get consistent and reliable results using Benford’s law:

  1. Sufficient sample size: Statistically valid conclusions could not be reached when the sample size is small like 10s of samples.
  2. A large span of number values: Various orders of magnitude samples, like the cost of products spanning from 1$ to 10000s $, help to get more reliable Benford’s law implications.
  3. Positively (“right-”) skewed distributions of numbers: When numbers are distributed in a way that they have a long right tail indicating large and relatively seldom seen numbers, it is more likely to see Benford’s pattern on those distributed numbers.
  4. Not human-assigned numbers: Numbers that are merely assigned, such as arbitrarily assigned telephone numbers, or postcodes tend not to exhibit Benford’s patterns.

Intenseye products perform analyzes using artificial intelligence techniques for more than 43 use cases installed on more than 2000 cameras in more than 140 facilities in more than 78 cities. As a result of these analyzes, various alert outputs are obtained and presented to the users. By examining the distribution of the generated alerts per camera, we can examine whether the outputs of the artificial intelligence-based solution have a distribution compatible with Benford’s law, or not.

Intenseye has been recording the alerts statistics produced at many facilities for more than four years. When we count these alerts for each individual camera, we can have nearly one thousand observations indicating the alert counts of each camera. With nearly a thousand alert count observations, we can meet the sufficient sample size constraint. However, we must remember that the more we increase the number of observations, the better the data can fit Benford’s law distributions.

Since the field of views and visualized area of each camera are quite different from each other and the use cases set up for the cameras can vary, the produced alert counts for each camera in facilities can span in range from 1 to 5 digits of numbers. Hence we can meet the second constraint related to the large span of number values.

Another criterion that needs to be checked for Benford’s law to produce reliable results is whether the distribution of the data is right-skewed. To control this criterion, the histogram distributions of the data can be examined. In addition to the histogram plot, the skewness value of the data can be measured to analyze right-skewness. The results of these examinations and controls of the data can be examined in the walkthrough section of this blog post. In advance, we can say that alert counts per camera data are highly right-skewed and meet the third constraint.

Since the alert numbers produced by Intenseye artificial intelligence and computer vision solutions are not human-assigned numbers, the 4th criterion “not human-assigned numbers” is successfully achieved. We can start to examine whether the alert count data is compatible with Benford’s law for different digits. In this blog post, we will examine the first 4 digits of individual camera alert counts.

Analyze with Code Walkthrough

Python language (python 3.8) and libraries will be used for Benford’s law-related analyzes and visualizations in the code walkthrough section of this blog post.

Before diving into the details of the code-based analysis it is advised to create a virtual environment using conda and activate it.

conda create -n benford python=3.8
conda activate benford

Then, the following python modules and packages should be installed via:

pip install matplotlib==3.5.1 numpy==1.22.3 openpyxl==3.0.9 pandas==1.4.1 scipy==1.8.0

After the installations, we need to import the required modules and enable the inline mode of matplotlib:

import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import numpy as np
import math
import scipy.stats as st
%matplotlib inline

Since the alert data are stored in a sheet of xlsx file we need to read from the file.

def read_data_from_xlsx(file_name, sheet):
    # Data is read from an Excel file.
    data = pd.read_excel(file_name, sheet)
    return data
benford_data = read_data_from_xlsx('Benfords_Law_Data.xlsx', 'Alert_Count_Per_Camera')

After reading the data, the skewness of the distribution needs to be checked.

def plot_hist(data, field_name):
    # Skewness of the alert count distribution is visualized with histogram plot.
    data.hist(column=field_name, bins=10)
    plt.title('Histogram plot of camera alert counts')
    plt.ylabel('Number of observations')
    plt.xlabel('Alert counts per camera')
    ax = plt.gca()
    ax.axes.xaxis.set_ticks([])
    # Max limit of y-axis is set to 100 to demonstrate each histogram bin properly.
    ax.set_ylim([0, 100])
    plt.show()
plot_hist(benford_data, 'alert_counts')

As seen in the histogram plot, the data is highly right-skewed as desired. Now let’s get the numbers in the first digits of alert counts per camera.

def get_digits(data, data_column, digit_):
    # Integer values are converted to sting to get digits by location
    digit_data = data[data_column].astype(str).str[digit_ - 1]
    # Size of each digit are get. Some cameras generate alerts with numbers of 4 digits or more, while others generate
    # fewer alerts. Those sizes are needed for confidence level calculations.
    size = digit_data.value_counts().sum()
    return digit_data, size
data_digit, sample_size = get_digits(benford_data, 'alert_counts', 1)

After obtaining the first digit numbers and the sample size we can plot the distribution of the numbers with given Benford’s Law probabilities. But first, we need to set the probabilities of Benford’s Law for all digits to be plotted along with actual distributions.

BENFORD_PROBS = {
    '1': [0.30103, 0.17609, 0.12494, 0.09691, 0.07918, 0.06695, 0.05799, 0.05115, 0.04576],
    '2': [0.11968, 0.11389, 0.10882, 0.10433, 0.10031, 0.09668, 0.09337, 0.09035, 0.08757, 0.08500],
    '3': [0.10178, 0.10138, 0.10097, 0.10057, 0.10018, 0.09979, 0.09940, 0.09902, 0.09864, 0.09827],
    '4': [0.10018, 0.10014, 0.10010, 0.10006, 0.10002, 0.099998, 0.09994, 0.09990, 0.09986, 0.09982]
    # Probabilities predicted by Benford's Law for the first and higher order digits form Nigrini 1996
}
def plot_digit_dist(data, digit_name):

    bar_width = 1.0 / 3
    counted_vals = data.value_counts()
    counted_vals = counted_vals.sort_index()
    # Sorted number counts are normalized to express them in percentages.
    counted_vals_norm = counted_vals.div(counted_vals.sum())
    index = np.arange(len(counted_vals_norm))

    fig, ax = plt.subplots()
    ax.bar(index, counted_vals_norm, bar_width, label="Actual distributions")
    ax.bar(index + bar_width, BENFORD_PROBS[digit_name], bar_width, label="Benford's distribution")

    ax.set_xlabel('Digit Numbers')
    ax.set_ylabel('Distribution percentages')
    ax.set_title('Distributions of numbers in the digit: ' + digit_name)
    ax.set_xticks(index + bar_width / 2)
    ax.set_xticklabels(counted_vals.index)
    # Max limit of y-axis is set to 0.4 (40%) to show bar charts properly.
    ax.set_ylim([0, 0.40])
    ticks_loc = ax.get_yticks().tolist()
    ax.yaxis.set_major_locator(mticker.FixedLocator(ticks_loc))
    ax.set_yticklabels(map('{:.1f}%'.format, 100 * ax.yaxis.get_majorticklocs() / sum(counted_vals_norm)))
    ax.legend()
    plt.show()

Let’s plot the bar chart showing the distributions for the first digit.

plot_digit_dist(data_digit, str(1))

As you can see actual distributions and Benford’s law probabilities show similar characteristics with some deviations. So, we need to check whether these deviations are within the acceptable limits, or not. To do so, we need to check deviations in the distribution of the first digits of alert counts from the ideal Benford’s law probabilities.

def calc_digit_deviations(data, digit_name):
    # Deviations from the Benford's dist. for actual dist. are calculated for each number.
    init_num = 0
    deviations = []
    if digit_name == '1':
        # For the first digit we need to skip '0'.
        init_num = 1
    for idx, prob in enumerate(BENFORD_PROBS[digit_name]):
        digit_index = str(idx + init_num)
        if digit_index in data.index:
            deviations.append(prob - data[digit_index])
        else:
            deviations.append(prob)
    return deviations
def calc_abs_deviation(data, digit_name):
    counted_vals = data.value_counts()
    # Number of occurrences for each number is calculated.
    counted_vals_norm = counted_vals.div(counted_vals.sum())
    # Occurrences are normalized and deviations from Benford dist. are calculated.
    deviations = calc_digit_deviations(counted_vals_norm, digit_name)
    # Absolute differences and the maximum value of it are measured.
    abs_dev = [abs(ele * 100) for ele in deviations]
    max_dev = max(abs_dev)
    return abs_dev, max_dev
abs_dev, max_dev_first = calc_abs_deviation(data_digit, str(1))
print(f"Deviations: {abs_dev}") # Deviations from the Benford's dist for each number in the first digit.
Deviations: [2.9111321473951737, 1.0695260482846247, 1.3560635324015247, 0.7283138500635333, 1.0565006353240147, 1.3582782719186781, 1.605861499364676, 2.12769377382465, 1.650175349428208]

As seen, deviation amounts vary between 0.728 % and 2.91% for different numbers in the first digits. Now let’s check the acceptable deviation ranges for each number in the first digit. We preferred to use 99.9% p-value coverage for the confidence interval analysis.

P_VALUE = .999  # 99.9% p-value is used for confidence interval checks
def calc_confidence_interval(size, digit_name):
    # Z-score for the given p value is calculated to be used in confidence level calculation.
    alpha = 1 - P_VALUE
    z_score = st.norm.ppf(1-alpha/2)
    accepted_devs = []
    for prob in BENFORD_PROBS[digit_name]:
        # std and confidence range is calculated based on "The effective use of Benford’s law to assist in detecting
        # fraud in accounting data" by Durtschi et al. 2004
        std_digit = math.sqrt(prob * (1 - prob) / size)
        accepted_dev = (z_score * std_digit + 1 / (2 * size)) * 100
        accepted_devs.append(accepted_dev)
    return accepted_devs
accepted_dev = calc_confidence_interval(sample_size, str(1))
print(f"Accepted deviations: {abs_dev}")  # Accepted deviations from the Benford's law for each number for the first digit.
Accepted deviations: [2.9111321473951737, 1.0695260482846247, 1.3560635324015247, 0.7283138500635333, 1.0565006353240147, 1.3582782719186781, 1.605861499364676, 2.12769377382465, 1.650175349428208]

Since we got both deviations and acceptable limits, we can check whether the distribution violates Benford’s Law or not by comparing them.

def check_inside_conf_int(ranges, diffs):
    ranges = np.array(ranges)
    diffs = np.array(diffs)
    return ((ranges - diffs) > 0).all()
are_inside = check_inside_conf_int(accepted_dev, abs_dev)
print(f"Are they inside the accepted limits?: {are_inside}") # Whether distributions of numbers stay inside the confidence interval for the first digit.
Are they inside the accepted limits?: True

For the first digit, alert counts per camera statistics can fit Benford’s law. Now we can apply the same operation for the second, third, and fourth digits.

For the second digit:

data_digit, sample_size = get_digits(benford_data, 'alert_counts', 2)
plot_digit_dist(data_digit, str(2))
abs_dev, max_dev_second = calc_abs_deviation(data_digit, str(2))
accepted_dev = calc_confidence_interval(sample_size, str(2))
are_inside = check_inside_conf_int(accepted_dev, abs_dev)
print(f"Are they inside the accepted limits?: {are_inside}") # Whether distributions of numbers stay inside the confidence interval for the second digit.
Are they inside the accepted limits?: True

For the third digit:

data_digit, sample_size = get_digits(benford_data, 'alert_counts', 3)
plot_digit_dist(data_digit, str(3))
abs_dev, max_dev_third = calc_abs_deviation(data_digit, str(3))
accepted_dev = calc_confidence_interval(sample_size, str(3))
are_inside = check_inside_conf_int(accepted_dev, abs_dev)
print(f"Are they inside the accepted limits?: {are_inside}") # Whether distributions of numbers stay inside the confidence interval for the third digit.
Are they inside the accepted limits?: True

For the fourth digit:

data_digit, sample_size = get_digits(benford_data, 'alert_counts', 4)
plot_digit_dist(data_digit, str(4))
abs_dev, max_dev_fourth = calc_abs_deviation(data_digit, str(4))
accepted_dev = calc_confidence_interval(sample_size, str(4))
are_inside = check_inside_conf_int(accepted_dev, abs_dev)
print(f"Are they inside the accepted limits?: {are_inside}") # Whether distributions of numbers stay inside the confidence interval for the fourth digit.
Are they inside the accepted limits?: True

As seen, all digits lie in between the acceptable limits of Benford’s Law probabilities. Let’s check the maximum deviations for all numbers in all digits.

max_dev_all_digits = max(max_dev_first, max_dev_second, max_dev_third, max_dev_fourth)
print(f"Maximum deviation for all numbers in all digits: {max_dev_all_digits}") # Maximum deviation in percentage for all numbers in all digits
Maximum deviation for all numbers in all digits: 2.9111321473951737

The maximum deviation is less than 3% for all numbers in all digits.

In this blog, by examining the alert numbers produced per camera by the computer vision and artificial intelligence solutions used in Intenseye, we have shown that the number digit distributions comply with Benford’s law. Also, the compliance of the data with Benford’s law is expected to increase further and further as the number of data increases. Hence we verified that the artificial intelligence applications designed not to commit fraud attempts do not commit fraud, at least the ones produced in Intenseye.

References

[1] https://en.wikipedia.org/wiki/Benford’s_law

[2] Hill, Theodore (1995). “A Statistical Derivation of the Significant-Digit Law”. Statistical Science. 10 (4). doi:10.1214/ss/1177009869.

[3] Nigrini, M. J. (1996). A taxpayer compliance application of Benford’s law. The Journal of the American Taxation Association, 18(1), 72.

[4] Goodman, William M. “Reality checks for a distributional assumption: The case of “Benford’s Law”.” Joint Statistical Meeting–Business and Economic Statistics Section. 2013.

[5] Durtschi, C., Hillison, W., & Pacini, C. (2004). The effective use of Benford’s law to assist in detecting fraud in accounting data. Journal of forensic accounting, 5(1), 17-34.

#AI
#Blog Post
#intenseye
Schedule a Demo