Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

ACADEMIA Letters

Dimensionality Reduction in EH&S Data Analysis


Jon Judge, Illinois State University

Utilizing tools from big data analytics - which deal with large, complex data sets containing
characteristics of volume, variety, velocity, veracity, value, and complexity, multiple occupa-
tional sectors have successfully employed these tools to assist in problem-solving and decision
making. Big data includes unstructured (text-heavy and unorganized) and multi-structured
data (including human-machine interactions). With ever-increasing volumes of data gener-
ated, the quantity is challenging to be handled from the standpoints of analysis as well as
storage and the sustainability of the data.
One such tool is Dimensionality Reduction, which refers to the transformation of data
from a high dimensional space to one of lower-dimensional space that retains some meaning-
ful properties of the original data (Van Der Maaten, Postm & Van den Herik, 2009). To state
more simply, it is a method of simplifying data to extract as much useful info from as little
data as appropriate. In the field of big data analytics, dimensionality reduction is performed
after data has been collected, and uses a variety of mathematical and statistical methods to
make determinations regarding which data to keep and which to disregard as irrelevant. The
techniques used, however, accomplish the same goal – reducing a vast amount of data into
something more manageable. The principles of dimensionality reduction can be used ahead of
and during data acquisition/collection as well, allowing for a pragmatic approach to collecting
data that serves the needs of the individual collecting or analyzing it, with the hope of stream-
lining or making more efficient the process of collecting EHS data and making decisions from
it.
To illustrate the application of dimensionality reduction principles, let us examine a case
study of a laboratory facility with a varied set of operations trying to assess the prevalence and
scope of incidents and injuries. This facility has been collecting data for a period of 10 years,
and in that time has utilized an instrument or form to collect information about each incident

Academia Letters, November 2021 ©2021 by the author — Open Access — Distributed under CC BY 4.0

Corresponding Author: Jon Judge, jonjudge@gmail.com


Citation: Judge, J. (2021). Dimensionality Reduction in EH&S Data Analysis. Academia Letters, Article 4109.
https://doi.org/10.20935/AL4109.

1
that occurs. The information collected is then used along with a series of administrative ac-
tions before finally ending up as ‘data’ to be analyzed by a safety and health practitioner at
some future point in time to help make decisions about training opportunities/needs, needs
assessments, and hazard analyses.
While the instrument used by the practitioner collects or attempts to collect, information
leading to some sort of a resolution of the incident in terms of a ‘root cause analysis’, it also
collects information of a demographic nature – age, gender, job title, time in a job, job section,
recent completion dates of training or certification, as examples. These can be significant to
the incidence rate of injuries or near-misses in the facility, but they might not be. The field of
Statistics offers a selection of tools to help determine what is or is not statistically significant,
and if the analysis of the data requires that level of precision, there are many options available.
In our example, the practitioner is looking over the data in a spreadsheet, highlighting
groups of cells, and looking over the charts that are drawn based on the data. While many
of these comparisons are unrevealing, others appear to highlight or illustrate some sort of
relationship between a variable or group of variables (the demographic information collected)
and the resultant (the incident or near-miss).
The analysis, largely non-mathematical and more in/deductive than calculated, reveals that
at certain times of the year; certain job types; and certain age demographics – all appear to
have ‘some sort’ of relationship with the incidence rate of accidents/near misses. More ‘slips,
trips and falls’ during the colder months when ground surface conditions are more variable;
more ‘technicians’ having accidents and near-misses than other job types; workers over the
age of 40-45 are having more incidents than other age demographics.
For other demographics, there appears to be minimal or no relationship, e.g., an ‘equal’
number of incidents across genders. While this information is revealing, it might not be suf-
ficient to address the issue the practitioner is trying to address. Collecting more information
(data) might assist in illuminating the issue. When ‘too much’ data is collected and the re-
sources available to the person trying to solve it are insufficient to be useful, we have a problem
with intractability (Hopcraft, Motwani & Ullman, 2001).
Consider an instrument such as a survey form for overall health, safety, and environmental
concerns. Such an instrument might contain upwards of one hundred distinct ‘points’ for
the inspector to address in the inspection. If we were to apply dimensionality reduction to
this instrument, we would be trying to streamline the one hundred question instrument to
something with fewer ‘points’, which still retains the original instrument’s ability to adequately
and satisfactorily capture the intended data. If an inspection form is adequate, it should be
able to capture data regarding events and situations that fall outside the boundaries of what is
deemed ‘acceptable’ for each measurement (i.e., if hazardous chemicals are being disposed of

Academia Letters, November 2021 ©2021 by the author — Open Access — Distributed under CC BY 4.0

Corresponding Author: Jon Judge, jonjudge@gmail.com


Citation: Judge, J. (2021). Dimensionality Reduction in EH&S Data Analysis. Academia Letters, Article 4109.
https://doi.org/10.20935/AL4109.

2
down sinks not intended for waste, the inspection form should be able to capture that data). A
dimensionally reduced form should be able to capture the same data, with the same confidence,
yet using fewer ‘points’ (variables).
Reducing a large data set into something more manageable for analysis is not a new con-
cept to anyone who has had to perform statistical analysis, but what of applying the principles
prior to data collection? In our example, the practitioner has a 10-year period of data to use
as a historical background to help in their decision making if they choose to move forward
with the modification or creation of a ‘new’ instrument if they choose to. With this amount
of information, the practitioner can examine the data to see what they need to capture in the
instrument.
If an existing form or procedure is capturing data that never changes significantly, is it
necessary to collect? Is it necessary to keep? Is it useful? How much time does it take to
collect? Is there some sort of benefit, tangible or no, to acquiring and handling that data?
What we seek is the intrinsic dimension. This is the smallest number of parameters re-
quired to model the data without loss. For our example, we can restate this to say this is the
smallest number of questions our laboratory inspection form can have and still reliably and
adequately assess the health, safety, and environmental performance of our labs.
We must consider, however, the ability of our instrument to measure what it is intended
to measure. Continuing with our example, if a laboratory has consistently performed ‘well’
in that they successfully pass inspections with few-to-no demerits, yet has safety and health-
related issues during non-inspection times (for instance dumping of toxic chemicals down a
sink drain), can we say that the instrument we are using to assess the safety and health of the
laboratory is adequate in its measurement? Is the tool too limited in scope from a standpoint
of frequency of measurements? Annual inspections versus quarterly?
Consider too those facilities who are held to some sort of measured or calculated perfor-
mance, quality, reliability, or safety metric derived from sources such as incidence rates, error
rates, turn-around time, proficiency at a particular function or task, etc. The result, the met-
ric, is a function of the variables used to derive it, which in this case might be an assessment
by a third party, a standardized metric that is not altered, or other assessment where it might
not be appropriate or possible to dimensionally reduce those data acquisition tools. However,
there exist opportunities to use the data generated by these tools in a pragmatic, dimensionally
reduced manner to assist with making determinations from the data and planning what to do
with the results. This is a situation of asking the question “are full data sets (inspections, sur-
veys, etc.) necessary to make decisions from, or can the process be simplified (dimensionally
reduced) and still retain an acceptable level of error (uncertainty)?” In this example, the di-
mensionality reduction would happen after the data is collected, rather than the data collected

Academia Letters, November 2021 ©2021 by the author — Open Access — Distributed under CC BY 4.0

Corresponding Author: Jon Judge, jonjudge@gmail.com


Citation: Judge, J. (2021). Dimensionality Reduction in EH&S Data Analysis. Academia Letters, Article 4109.
https://doi.org/10.20935/AL4109.

3
from a dimensionally reduced instrument.
If we consider dimensionality reduction to be the process of taking a complex equation
or procedure with multiple dimensions or degrees of freedom and eliminating, in a pragmatic
manner, those variables that are not necessary or are not significant to the result or the intent
of the iteration of the equation – this process can be applied in a largely non-mathematical
manner, to assist practitioners in the acquisition, analysis, and decision-making process. This
approach takes what is commonly referenced in mathematical and statistical circles and makes
it available as a principle or methodology that can be employed on both ends of data acquisition
and handling.
When collecting information from a system, the challenge is how to use that collected
information initially, then to determine which/what of that information to retain for use in the
future, and in what form. Questions arise such as “are full data sets (inspections, surveys, etc.)
necessary to make decisions from, or can the process be simplified (dimensionally reduced)
and still retain an acceptable level of error (uncertainty)?”, as well as “If streamlining and
simplifying an instrument to collect data is done, how is it done?”

Academia Letters, November 2021 ©2021 by the author — Open Access — Distributed under CC BY 4.0

Corresponding Author: Jon Judge, jonjudge@gmail.com


Citation: Judge, J. (2021). Dimensionality Reduction in EH&S Data Analysis. Academia Letters, Article 4109.
https://doi.org/10.20935/AL4109.

4
References
Hopcroft, J. E., Motwani, R., & Ullman, J. D. (2001). Introduction to automata theory, lan-
guages, and computation. Acm Sigact News, 32(1), 60-65.

Van Der Maaten, L., Postma, E., & Van den Herik, J. (2009). Dimensionality reduction: a
comparative. J Mach Learn Res, 10(66-71), 13.

Academia Letters, November 2021 ©2021 by the author — Open Access — Distributed under CC BY 4.0

Corresponding Author: Jon Judge, jonjudge@gmail.com


Citation: Judge, J. (2021). Dimensionality Reduction in EH&S Data Analysis. Academia Letters, Article 4109.
https://doi.org/10.20935/AL4109.

You might also like