Unstructured data - Wikipedia

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Unstructured data

Unstructured data (or unstructured information) is information that either does not have a pre- defined
data model or is not organized in a pre- defined manner. Unstructured information is typically text-
heavy, but may contain data such as dates, numbers, and facts as well. This results in irregularities
and ambiguities that make it difficult to understand using traditional programs as compared to data
stored in fielded form in databases or annotated (semantically tagged) in documents.

In 1998, Merrill Lynch said "unstructured data comprises the vast majority of data found in an
organization, some estimates run as high as 80%."[1] It's unclear what the source of this number is, but
nonetheless it is accepted by some.[2] Other sources have reported similar or higher percentages of
unstructured data.[3][4][5]

As of 2012, IDC and Dell EMC project that data will grow to 40 zettabytes by 2020, resulting in a 50-
fold growth from the beginning of 2010.[6] More recently, IDC and Seagate predict that the global
datasphere will grow to 163 zettabytes by 2025 [7] and majority of that will be unstructured. The
Computer World magazine states that unstructured information might account for more than 70–80%
of all data in organizations.[1] (https://en.wikipedia.org/wiki/Unstructured_data#endnote_computerworld)

Background

The earliest research into business intelligence focused in on unstructured textual data, rather than
numerical data.[8] As early as 1958, computer science researchers like H.P. Luhn were particularly
concerned with the extraction and classification of unstructured text.[8] However, only since the turn of
the century has the technology caught up with the research interest. In 2004, the SAS Institute
developed the SAS Text Miner, which uses Singular Value Decomposition (SVD) to reduce a hyper-
dimensional textual space into smaller dimensions for significantly more efficient machine- analysis.[9]
The mathematical and technological advances sparked by machine textual analysis prompted a
number of businesses to research applications, leading to the development of fields like sentiment
analysis, voice of the customer mining, and call center optimization.[10] The emergence of Big Data in
the late 2000s led to a heightened interest in the applications of unstructured data analytics in
contemporary fields such as predictive analytics and root cause analysis.[11]

Issues with terminology

The term is imprecise for several reasons:

1. Structure, while not formally defined, can still be implied.


2. Data with some form of structure may still be characterized as unstructured if its structure is not
helpful for the processing task at hand.

3. Unstructured information might have some structure (semi- structured) or even be highly
structured but in ways that are unanticipated or unannounced.

Dealing with unstructured data

Techniques such as data mining, natural language processing (NLP), and text analytics provide
different methods to find patterns in, or otherwise interpret, this information. Common techniques for
structuring text usually involve manual tagging with metadata or part- of- speech tagging for further
text mining- based structuring. The Unstructured Information Management Architecture (UIMA)
standard provided a common framework for processing this information to extract meaning and
create structured data about the information.

Software that creates machine- processable structure can utilize the linguistic, auditory, and visual
structure that exist in all forms of human communication.[12] Algorithms can infer this inherent
structure from text, for instance, by examining word morphology, sentence syntax, and other small-
and large- scale patterns. Unstructured information can then be enriched and tagged to address
ambiguities and relevancy- based techniques then used to facilitate search and discovery. Examples
of "unstructured data" may include books, journals, documents, metadata, health records, audio, video,
analog data, images, files, and unstructured text such as the body of an e- mail message, Web page,
or word- processor document. While the main content being conveyed does not have a defined
structure, it generally comes packaged in objects (e.g. in files or documents, ...) that themselves have
structure and are thus a mix of structured and unstructured data, but collectively this is still referred to
as "unstructured data".[13] For example, an HTML web page is tagged, but HTML mark- up typically
serves solely for rendering. It does not capture the meaning or function of tagged elements in ways
that support automated processing of the information content of the page. XHTML tagging does
allow machine processing of elements, although it typically does not capture or convey the semantic
meaning of tagged terms.

Since unstructured data commonly occurs in electronic documents, the use of a content or document
management system which can categorize entire documents is often preferred over data transfer and
manipulation from within the documents. Document management thus provides the means to convey
structure onto document collections.

Search engines have become popular tools for indexing and searching through such data, especially
text.
Approaches in natural language processing

Specific computational workflows have been developed to impose structure upon the unstructured
data contained within text documents. These workflows are generally designed to handle sets of
thousands or even millions of documents, or far more than manual approaches to annotation may
permit. Several of these approaches are based upon the concept of online analytical processing, or
OLAP, and may be supported by data models such as text cubes.[14] Once document metadata is
available through a data model, generating summaries of subsets of documents (i.e., cells within a
text cube) may be performed with phrase- based approaches.[15]

Approaches in medicine and biomedical research

Biomedical research generates one major source of unstructured data as researchers often publish
their findings in scholarly journals. Though the language in these documents is challenging to derive
structural elements from (e.g., due to the complicated technical vocabulary contained within and the
domain knowledge required to fully contextualize observations), the results of these activities may
yield links between technical and medical studies [16] and clues regarding new disease therapies.[17]
Recent efforts to enforce structure upon biomedical documents include self- organizing map
approaches for identifying topics among documents,[18] general- purpose unsupervised algorithms,[19]
and an application of the CaseOLAP workflow[15] to determine associations between protein names
and cardiovascular disease topics in the literature.[20] CaseOLAP defines phrase- category
relationships in an accurate (identifies relationships), consistent (highly reproducible), and efficient
manner. This platform offers enhanced accessibility and empowers the biomedical community with
phrase- mining tools for widespread biomedical research applications.[20]

The use of "unstructured" in data privacy regulations

In Sweden (EU), pre 2018, some data privacy regulations did not apply if the data in question was
confirmed as "unstructured".[21] This terminology, unstructured data, is rarely used in the EU after
GDPR came into force in 2018. GDPR does neither mention nor define "unstructured data". It does use
the word "structured" as follows (without defining it);

Parts of GDPR Recital 15, "The protection of natural persons should apply to the processing of
personal data ... if ... contained in a filing system."

GDPR Article 4, "‘filing system’ means any structured set of personal data which are accessible
according to specific criteria ..."
GDPR Case- law on what defines a "filing system"; "the specific criterion and the specific form in
which the set of personal data collected by each of the members who engage in preaching is actually
structured is irrelevant, so long as that set of data makes it possible for the data relating to a specific
person who has been contacted to be easily retrieved, which is however for the referring court to
ascertain in the light of all the circumstances of the case in the main proceedings.” (CJEU, Todistajat v.
Tietosuojavaltuutettu, Jehovan, Paragraph 61 (https://curia.europa.eu/juris/document/document.jsf?d
ocid=203822&doclang=EN%7CJehovan) ).

If personal data is easily retrieved - then it is a filing system and - then it is in scope for GDPR
regardless of being "structured" or "unstructured". Most electronic systems today, subject to access
and applied software, can allow for easy retrieval of data.

See also

Clustering

Pattern recognition

List of text mining software

Semi- structured data

Structured data

Notes

1. ^ Today's Challenge in Government: What to do with Unstructured Information and Why Doing
Nothing Isn't An Option, Noel Yuhanna, Principal Analyst, Forrester Research, Nov 2010

References

1. Shilakes, Christopher C.; Tylman, Julie (16 Nov 1998). "Enterprise Information Portals" (https://web.archiv
e.org/web/20110724175845/http://ikt.hia.no/perep/eip_ ind.pdf) (PDF). Merrill Lynch. Archived from
the original (http://ikt.hia.no/perep/eip_ ind.pdf) (PDF) on 24 July 2011.

2. Grimes, Seth (1 August 2008). "Unstructured Data and the 80 Percent Rule" (http://breakthroughanalysis.
com/2008/08/01/unstructured-data-and-the-80-percent-rule) . Breakthrough Analysis - Bridgepoints.
Clarabridge.

3. Gandomi, Amir; Haider, Murtaza (April 2015). "Beyond the hype: Big data concepts, methods, and
analytics" (https://doi.org/10.1016%2Fj.ijinfomgt.2014.10.007) . International Journal of Information
Management. 35 (2): 137–144. doi:10.1016/j.ijinfomgt.2014.10.007 (https://doi.org/10.1016%2Fj.ijinfomgt.
2014.10.007) . ISSN 0268-4012 (https://www.worldcat.org/issn/0268-4012) .
4. "The biggest data challenges that you might not even know you have - Watson" (https://www.ibm.com/
blogs/watson/2016/05/biggest-data-challenges-might-not-even-know/) . Watson. 2016-05-25.
Retrieved 2018-10-02.

5. "Structured vs. Unstructured Data" (https://www.datamation.com/big-data/structured-vs-unstructured-d


ata.html) . www.datamation.com. Retrieved 2018-10-02.

6. "EMC News Press Release: New Digital Universe Study Reveals Big Data Gap: Less Than 1% of World's
Data is Analyzed; Less Than 20% is Protected" (http://www.emc.com/about/news/press/2012/2012121
1-01.htm) . www.emc.com. EMC Corporation. December 2012.

7. "Trends | Seagate US" (https://www.seagate.com/our-story/data-age-2025/) . Seagate.com. Retrieved


2018-10-01.

8. Grimes, Seth. "A Brief History of Text Analytics" (http://www.b-eye-network.com/view/6311) . B Eye


Network. Retrieved June 24, 2016.

9. Albright, Russ. "Taming Text with the SVD" (https://web.archive.org/web/20160930182157/http://ftp.sas.


com/techsup/download/EMiner/TamingTextwiththeSVD.pdf) (PDF). SAS. Archived from the original (h
ttp://ftp.sas.com/techsup/download/EMiner/TamingTextwiththeSVD.pdf) (PDF) on 2016-09-30.
Retrieved June 24, 2016.

10. Desai, Manish (2009-08-09). "Applications of Text Analytics" (http://mybusinessanalytics.blogspot.com/2


009/08/applications-of-text-analytics.html) . My Business Analytics @ Blogspot. Retrieved June 24,
2016.

11. Chakraborty, Goutam. "Analysis of Unstructured Data: Applications of Text Analytics and Sentiment
Mining" (https://support.sas.com/resources/papers/proceedings14/1288-2014.pdf) (PDF). SAS.
Retrieved June 24, 2016.

12. "Structure, Models and Meaning: Is "unstructured" data merely unmodeled?" (http://www.intelligententer
prise.com/showArticle.jhtml?articleID=59301538) . InformationWeek. March 1, 2005.

13. Malone, Robert (April 5, 2007). "Structuring Unstructured Data" (https://www.forbes.com/2007/04/04/ter


adata-solution-software-biz-logistics-cx_ rm_ 0405data.html) . Forbes.

14. Lin, Cindy Xide; Ding, Bolin; Han, Jiawei; Zhu, Feida; Zhao, Bo (December 2008). "Text Cube: Computing IR
Measures for Multidimensional Text Database Analysis". 2008 Eighth IEEE International Conference on
Data Mining. IEEE. pp. 905–910. CiteSeerX 10.1.1.215.3177 (https://citeseerx.ist.psu.edu/viewdoc/summa
ry?doi=10.1.1.215.3177) . doi:10.1109/icdm.2008.135 (https://doi.org/10.1109%2Ficdm.2008.135) .
ISBN 9780769535029. S2CID 1522480 (https://api.semanticscholar.org/CorpusID:1522480) .

15. Tao, Fangbo; Zhuang, Honglei; Yu, Chi Wang; Wang, Qi; Cassidy, Taylor; Kaplan, Lance; Voss, Clare; Han,
Jiawei (2016). "Multi-Dimensional, Phrase-Based Summarization in Text Cubes" (http://sites.computer.or
g/debull/A16sept/p74.pdf) (PDF).
16. Collier, Nigel; Nazarenko, Adeline; Baud, Robert; Ruch, Patrick (June 2006). "Recent advances in natural
language processing for biomedical applications". International Journal of Medical Informatics. 75 (6):
413–417. doi:10.1016/j.ijmedinf.2005.06.008 (https://doi.org/10.1016%2Fj.ijmedinf.2005.06.008) .
ISSN 1386-5056 (https://www.worldcat.org/issn/1386-5056) . PMID 16139564 (https://pubmed.ncbi.nl
m.nih.gov/16139564) . S2CID 31449783 (https://api.semanticscholar.org/CorpusID:31449783) .

17. Gonzalez, Graciela H.; Tahsin, Tasnia; Goodale, Britton C.; Greene, Anna C.; Greene, Casey S. (January
2016). "Recent Advances and Emerging Applications in Text and Data Mining for Biomedical Discovery"
(https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4719073) . Briefings in Bioinformatics. 17 (1): 33–42.
doi:10.1093/bib/bbv087 (https://doi.org/10.1093%2Fbib%2Fbbv087) . ISSN 1477-4054 (https://www.worl
dcat.org/issn/1477-4054) . PMC 4719073 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4719073) .
PMID 26420781 (https://pubmed.ncbi.nlm.nih.gov/26420781) .

18. Skupin, André; Biberstine, Joseph R.; Börner, Katy (2013). "Visualizing the topical structure of the
medical sciences: a self-organizing map approach" (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC359
5294) . PLOS ONE. 8 (3): e58779. Bibcode:2013PLoSO...858779S (https://ui.adsabs.harvard.edu/abs/201
3PLoSO...858779S) . doi:10.1371/journal.pone.0058779 (https://doi.org/10.1371%2Fjournal.pone.005877
9) . ISSN 1932-6203 (https://www.worldcat.org/issn/1932-6203) . PMC 3595294 (https://www.ncbi.nl
m.nih.gov/pmc/articles/PMC3595294) . PMID 23554924 (https://pubmed.ncbi.nlm.nih.gov/23554924) .

19. Kiela, Douwe; Guo, Yufan; Stenius, Ulla; Korhonen, Anna (2015-04-01). "Unsupervised discovery of
information structure in biomedical documents" (https://doi.org/10.1093%2Fbioinformatics%2Fbtu75
8) . Bioinformatics. 31 (7): 1084–1092. doi:10.1093/bioinformatics/btu758 (https://doi.org/10.1093%2Fb
ioinformatics%2Fbtu758) . ISSN 1367-4811 (https://www.worldcat.org/issn/1367-4811) .
PMID 25411329 (https://pubmed.ncbi.nlm.nih.gov/25411329) .

20. Liem, David A.; Murali, Sanjana; Sigdel, Dibakar; Shi, Yu; Wang, Xuan; Shen, Jiaming; Choi, Howard;
Caufield, John H.; Wang, Wei; Ping, Peipei; Han, Jiawei (Oct 1, 2018). "Phrase mining of textual data to
analyze extracellular matrix protein patterns across cardiovascular disease" (https://www.ncbi.nlm.nih.go
v/pmc/articles/PMC6230912) . American Journal of Physiology. Heart and Circulatory Physiology. 315
(4): H910–H924. doi:10.1152/ajpheart.00175.2018 (https://doi.org/10.1152%2Fajpheart.00175.2018) .
ISSN 1522-1539 (https://www.worldcat.org/issn/1522-1539) . PMC 6230912 (https://www.ncbi.nlm.nih.g
ov/pmc/articles/PMC6230912) . PMID 29775406 (https://pubmed.ncbi.nlm.nih.gov/29775406) .

21. "Swedish data privacy regulations discontinue separation of "unstructured" and "structured" " (https://sv
erigeskommunikatorer.se/kunskap/nyheter/gdpr-del-3--missbruksregeln-upphor-vad-innebar-det-for-ko
mmunikatoren/#:~:text=Vad%20inneb%C3%A4r%20Missbruksregeln%3F,men%20%C3%A4ven%20publice
ring%20av%20bilder) .

External links

Matching Unstructured Data and Structured Data (http://www.tdan.com/view- articles/5009)


a brief description for Structured Data (https://dynomapper.com/blog/21- sitemaps- and- seo/433-
what- is- structured- data- for- seo)

Unstructured Data Definition, Examples, Benefits & Challenges (https://securiti.ai/unstructured- data


- 101- definition- examples- benefits- challenges/)

You might also like