Professional Documents
Culture Documents
TXSA Lecture-7-9-2023 PDF
TXSA Lecture-7-9-2023 PDF
TXSA Lecture-7-9-2023 PDF
Both data mining and data extraction are processes that involve the retrieval of data, but they
serve different purposes and require different techniques and tools. Here's a breakdown:
Data Mining
1. Purpose: The primary goal is to discover patterns, correlations, and insights within large sets
of data. It's more about deriving new information from data you already have.
2. Scope: Operates on a larger scale, often involving complex datasets like those found in big
data applications.
3. Methods: Utilizes sophisticated algorithms and techniques such as clustering, regression, and
neural networks.
4. Output: The output is often new knowledge—such as patterns or predictions—that can be
actionable or offer a competitive advantage.
5. Examples: Identifying customer segments in retail, predicting disease outbreaks in
healthcare, or detecting fraud in financial transactions.
6. Tools: Software like RapidMiner, IBM SPSS Modeler, and Python libraries like scikit-learn are
commonly used.
Data Extraction
1. Purpose: The primary goal is to retrieve data from a source and move it into a target
database or file for further operations. It's more about collecting raw data for immediate use or
for storing it for future applications.
2. Scope: Can operate on any scale, from extracting a single piece of information to pulling
large sets of data.
3. Methods: Often involves simpler techniques like web scraping, database querying, or use of
APIs.
4. Output: The output is raw data, structured or unstructured, which can be used immediately or
stored for future use.
5. Examples: Pulling sales records from a database, scraping website data, or retrieving sensor
data from IoT devices.
6. Tools: Software like Octoparse, Import.io for web scraping, or SQL for database extraction.
In summary, data mining is more about finding insights from data, while data extraction is
focused on the retrieval of data from various sources. Both are crucial steps in the data lifecycle
but serve different needs and require different skill sets.
Text Mining
1. Purpose: The primary goal is to discover previously unknown information or patterns from
large volumes of text. It's more exploratory in nature.
2. Scope: Generally deals with larger datasets and aims to identify patterns or trends that are
not immediately obvious.
3. Methods: Utilizes Natural Language Processing (NLP), machine learning algorithms, and
statistical models to identify themes, concepts, or patterns.
4. Output: The output is often new knowledge or insights, such as identifying latent topics in
scientific literature or sentiment trends in social media.
5. Examples: Topic modeling in academic papers, sentiment analysis in customer reviews, or
relationship extraction in legal documents.
6. Tools: Software like RapidMiner Text Mining Extension, IBM Watson Natural Language
Understanding, and Python libraries like NLTK or spaCy.
Text Analytics
1. Purpose: The primary goal is to convert unstructured text data into structured data for
analysis, via classification, tagging, or indexing.
2. Scope: Often deals with smaller sets of data and aims to solve more immediate, specific
problems.
3. Methods: Also uses NLP but focuses on simpler tasks like text categorization, sentiment
analysis, or keyword extraction.
4. Output: The output is structured data that can be easily quantified and analyzed, such as
categorizing customer feedback into "positive," "neutral," or "negative."
5. Examples: Spam filtering in emails, customer service automation, or extracting specific
information like dates or names from documents.
6. Tools: Software like TextBlob, MonkeyLearn, or built-in features in data analytics platforms
like Tableau.
In essence, text mining is more concerned with the discovery of hidden patterns or trends in text
data, while text analytics focuses on structuring unstructured text data for easier analysis or
decision-making. Both are subsets of the broader field of NLP and often use similar tools and
techniques, but they are applied for different purposes.
Hot take****
AI will not replace humans
But humans who knows how to use AI will replace humans that doesn’t
Subjective = feeling
Objective = facts
Humans matters as sensors as different person may give out different opinion from multiple
perspectives.
Wider range.
### Example:
Here's a simple example of space-delimited lines in a hypothetical file containing names and
ages:
```
John 25
Emily 30
Alice 22
Bob 28
```
In this example, each line represents a record, and within each line, the name and age are
separated by a space.
### Usage:
Space-delimited files are often used for simple data storage or for transferring data between
programs that can parse text files. However, they are less common than other formats like CSV
(Comma-Separated Values) or TSV (Tab-Separated Values) because spaces can appear within
the data fields themselves, causing parsing issues.
### Parsing:
To parse a space-delimited file, you would typically read the file line by line and then split each
line by spaces to get the individual fields.
In Python, for example, you could use the following code to read a space-delimited file:
```python
with open('space_delimited_file.txt', 'r') as f:
for line in f:
name, age = line.strip().split(' ')
print(f"Name: {name}, Age: {age}")
```
```
Name: John, Age: 25
Name: Emily, Age: 30
Name: Alice, Age: 22
Name: Bob, Age: 28
```
Space-delimited lines are straightforward but should be used cautiously, especially if the data
fields can contain spaces themselves.
Topic mining: Mining within data to organize and add topics
Eg. Mining and finding posts is talking bad on someone (It wasn’t obvious)
Sentiment Analysis: Mining within data to find out the feeling on the writer
Eg. Mining and finding the emotion of the writer when writing a post
### Ambiguity
1. **Lexical Ambiguity**: A single word can have multiple meanings (e.g., "bat" could refer to an
animal or a sports equipment).
2. **Syntactic Ambiguity**: Sentences can be interpreted in multiple ways due to their structure
(e.g., "I saw the man with the telescope").