Download as pdf or txt
Download as pdf or txt
You are on page 1of 26

Fake Data – Rationale, Detection and

Implications
Alessandra Rachetti & Wolfhard Wegscheider
Fraudulent „Fake“ Data - Facts
About 2 % of reseachers have admitted to faking data at
least once in their careers.

Blurred boundaries between innocent error,


misunderstandings, avoidable faults, intentional
“bending” and massive falsification.

Fraud implies intention to cheat.

Eurachem Workshop Dublin May 2018 2


Research Misconduct
Official Definition

“fabrication, falsification or plagiarism


(FFP) in proposing, performing or
reviewing research or in reporting
research results”

US office of Science and Technology Policy (OSTP)

Eurachem Workshop Dublin May 2018 3


Fake Data - Characteristics
Falsified, manipulated data: observations that do not fit the
desired results are deleted or amended and the variability as a
whole is reduced.

Fabricated, invented data: very little variation, total absence


of outliers, and because of human intervention, a pattern of digit
preference. Invented distributions tend to be flat, evenly spread
over a limited range.

Eurachem Workshop Dublin May 2018 4


Fake Data - Objective
Falsification / Bending / Data manipulation: to achieve a
desired result or increase the statistical significance of the findings
and affect the overall scientific conclusions, to achieve publication,
or to produce results confirming a particular theory.

The object of most falsifications is to demonstrate a


“statistically significant” effect that the genuine data
would not show.

Eurachem Workshop Dublin May 2018 5


Fake Data - Objective
Fabrication / Invention of data for non-existent or incomplete
cases (in clinical studies, market research), usually for financial
gain.

The most serious cases of fraud are those in which there


is an expectation of gain in terms of prestige,
advancement, or money.

Almost never occurs in fields like physics, astronomy and geology.


David Goldstein, 2005
Eurachem Workshop Dublin May 2018 6
Statistical Methods
for Detecting Fake Data
Look at digit distribution and preferences

Look at variances, standard deviations, percentile


ranges, range, kurtosis

Multivariate associations - Look for relationships


that should exist

Eurachem Workshop Dublin May 2018 7


Digit Preference – First Digit
Benford‘s Law

▪ Runs against intuition


▪ mainly for counting and measurement data
▪ not for assigned data or numbers influenced by human
thought

Eurachem Workshop Dublin May 2018 8


Benford‘s Law -Examples

from Theodore P. Hill, 1998


Eurachem Workshop Dublin May 2018 9
Digit Preference – Terminal Digit
Terminal Digits are supposed to be uniformly distributed as
they are expected to contain mostly random measurenment
error.

Humans instinctively do exhibit digit preferences

Well suited for graphic methods of detection,


Histogram, Stem & Leaf plot

Eurachem Workshop Dublin May 2018 10


Digit Preference – Stem and Leaf Plot

heights of 351 (elderly) women.

Data source: http://what-when-how.com/statistics/skewness-to-systematic-review-


statistics/
Eurachem Workshop Dublin May 2018 Folie 11
Digit Preference – Histogram
60

50

40

Frequency
30

20

10

0
0 1 2 3 4 5 6 7 8 9
Terminal Digit

heights of 351 elderly women.

Eurachem Workshop Dublin May 2018 Folie 12


Fake Data / Possum Example

104 mountain brushtail possums


9 morphometric measurements

Head length, skull width

Picture source: http://www.environment.nsw.gov.au/topics/animals-and-plants/native-animals /native-


animal-facts/brush-tailed-possum
Data source: Lindenmayer, D. B., Viggers, K. L., Cunningham, R. B., and Donnelly, C. F. 1995. Morphological
variation among columns of the mountain brushtail possum, Trichosurus caninus Ogilby (Phalangeridae:
Marsupiala). Australian Journal of Zoology 43: 449-458.
https://vincentarelbundock.github.io/Rdatasets/datasets.html
Eurachem Workshop Dublin May 2018 Folie 13
Possum head length / true data
Histogram
30 120.00%
Mean 92.6 mm
25 St Dev 3.57 mm 100.00%

20 80.00%

15 60.00%

10 40.00%

5 20.00%

0 0.00%
82.5 84.56 86.62 88.68 90.74 92.8 94.86 96.92 98.98 101.04 und
größer

Eurachem Workshop Dublin May 2018 14


Possum head length / fake data
Histogram
30 120.00%
Mean 92.6 mm
25 St Dev 3.57 mm 100.00%

20 80.00%

15 60.00%

10 40.00%

5 20.00%

0 0.00%
82.4 84.02 85.64 87.26 88.88 90.5 92.12 93.74 95.36 96.98 und
größer

Eurachem Workshop Dublin May 2018 15


Possum head length / true data
terminal digit distribution
20

18

16

14

12
frequency

10

0
0 1 2 3 4 5 6 7 8 9

Eurachem Workshop Dublin May 2018 16


Possum head length / fake data
terminal digit distribution
20

18

16

14

12
frequency

10

0
0 1 2 3 4 5 6 7 8 9

Eurachem Workshop Dublin May 2018 17


Possum skull width/ true data
Histogram
35 120.00%
Mean 56.9 mm
30
St Dev 3.11 mm 100.00%

25
80.00%

20
60.00%
15

40.00%
10

20.00%
5

0 0.00%
50 51.86 53.72 55.58 57.44 59.3 61.16 63.02 64.88 66.74 und
größer

Eurachem Workshop Dublin May 2018 18


Possum skull width/ fake data
Histogram
35 120.00%
Mean 56.9 mm
30
St Dev 3.12 mm 100.00%

25
80.00%

20
60.00%
15

40.00%
10

20.00%
5

0 0.00%
49.2 50.44 51.68 52.92 54.16 55.4 56.64 57.88 59.12 60.36 und
größer

Eurachem Workshop Dublin May 2018 19


True data: Possum skull width /
head length
75

70

65
skull width

60

55

50

45

40
80 85 90 95 100
head length

Eurachem Workshop Dublin May 2018 20


Fake data: Possum skull width /
head length
75

simulated skull width 70

65

60

55

50

45

40
80 82 84 86 88 90 92 94 96 98 100
simulated head length

Eurachem Workshop Dublin May 2018 21


Fake Data – Risk Factors
Career pressure

„Knowing the answer“

Working in a small team

Working in a field where individual experiments are


not expected to be precisely reproducible
David Goldstein, 2005

Eurachem Workshop Dublin May 2018 22


Fake Data - What to do ?
Increase risk of exposure

Peer review
Full access to original data
Public data repositories
Better education of statisticians
Devote a significant amount of research funds
for replications
Automated scanning of publications

Eurachem Workshop Dublin May 2018 23


Antonakis‘ 5 scientific diseases
Significosis, an inordinate focus on statistically significant
results
Neophilia, an excessive appreciation for novelty
Theorrea, a mania for new theory
Arigorium, a deficiency of rigor in theoretical and empirical
work
Disjunctivitis, a proclivity to produce large quantities of
redundant, trivial and incoherent works

Eurachem Workshop Dublin May 2018 24


“If you copy from one author, it’s plagiarism,
but if you copy from many, it’s research.”

Wilson Mizner

Eurachem Workshop Dublin May 2018 25


Resources
Fanelli, D., How many scientists fabricate and falsify research? A systematic review and
meta-analysis of survey data. Plos One 2009, 4, e5738.
OSTP Federal Policy on Research Misconduct, cited from Martinson, B. C., Anderson, M. S.,
de Vries, R. (2005). Scientists behaving badly. Nature, 435, 737-738 doi:10.1038/435737a
Fanelli, D. Redefine misconduct as distorted reporting ( 2013 ). Nature 494, 149.
doi:10.1038/494149a
Hill, T. P. (1998). The first digit phenomenon. Amer. Scientist 86:358–363.
Goldstein, D. (2005). Conduct and Misconduct in Science. Retrieved
from http://www.physics.ohio-state.edu/~wilkins/onepage/conduct.html
Evans, S. (2001). Statistical aspects of the detection of fraud. In Lock, S., Wells, F., Farthing,
M. (Eds), Fraud and Misconduct in Biomedical Research (186-203). London, England: BMJ
Books,
Antonakis, J. (2017). On doing better science: From thrill of discovery to policy implications.
The Leadership Quarterly, 29, 5-21. http://dx.doi.org/10.1016/j.leaqua.2017.01.006

Eurachem Workshop Dublin May 2018 26

You might also like