TXSA Lecture-7-9-2023 PDF

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Exam is from slides, not lab

Mock will be given, to also test lock down browsers

Data analytics vs data analysis:


- data analytics have ML and algo within (prediction etc)
- Data analysis can be performed by anyone

Data mining vs data extraction


- Data extraction, retrieving info
- Data mining, digging out info

Both data mining and data extraction are processes that involve the retrieval of data, but they
serve different purposes and require different techniques and tools. Here's a breakdown:

Data Mining
1. Purpose: The primary goal is to discover patterns, correlations, and insights within large sets
of data. It's more about deriving new information from data you already have.
2. Scope: Operates on a larger scale, often involving complex datasets like those found in big
data applications.
3. Methods: Utilizes sophisticated algorithms and techniques such as clustering, regression, and
neural networks.
4. Output: The output is often new knowledge—such as patterns or predictions—that can be
actionable or offer a competitive advantage.
5. Examples: Identifying customer segments in retail, predicting disease outbreaks in
healthcare, or detecting fraud in financial transactions.
6. Tools: Software like RapidMiner, IBM SPSS Modeler, and Python libraries like scikit-learn are
commonly used.

Data Extraction
1. Purpose: The primary goal is to retrieve data from a source and move it into a target
database or file for further operations. It's more about collecting raw data for immediate use or
for storing it for future applications.
2. Scope: Can operate on any scale, from extracting a single piece of information to pulling
large sets of data.
3. Methods: Often involves simpler techniques like web scraping, database querying, or use of
APIs.
4. Output: The output is raw data, structured or unstructured, which can be used immediately or
stored for future use.
5. Examples: Pulling sales records from a database, scraping website data, or retrieving sensor
data from IoT devices.
6. Tools: Software like Octoparse, Import.io for web scraping, or SQL for database extraction.
In summary, data mining is more about finding insights from data, while data extraction is
focused on the retrieval of data from various sources. Both are crucial steps in the data lifecycle
but serve different needs and require different skill sets.

Text Mining vs Text Analytics


Text mining and text analytics are closely related fields that both deal with the analysis of
unstructured text data, but they differ in scope, methods, and objectives. Here's a breakdown to
clarify the distinctions:

Text Mining
1. Purpose: The primary goal is to discover previously unknown information or patterns from
large volumes of text. It's more exploratory in nature.
2. Scope: Generally deals with larger datasets and aims to identify patterns or trends that are
not immediately obvious.
3. Methods: Utilizes Natural Language Processing (NLP), machine learning algorithms, and
statistical models to identify themes, concepts, or patterns.
4. Output: The output is often new knowledge or insights, such as identifying latent topics in
scientific literature or sentiment trends in social media.
5. Examples: Topic modeling in academic papers, sentiment analysis in customer reviews, or
relationship extraction in legal documents.
6. Tools: Software like RapidMiner Text Mining Extension, IBM Watson Natural Language
Understanding, and Python libraries like NLTK or spaCy.

Text Analytics
1. Purpose: The primary goal is to convert unstructured text data into structured data for
analysis, via classification, tagging, or indexing.
2. Scope: Often deals with smaller sets of data and aims to solve more immediate, specific
problems.
3. Methods: Also uses NLP but focuses on simpler tasks like text categorization, sentiment
analysis, or keyword extraction.
4. Output: The output is structured data that can be easily quantified and analyzed, such as
categorizing customer feedback into "positive," "neutral," or "negative."
5. Examples: Spam filtering in emails, customer service automation, or extracting specific
information like dates or names from documents.
6. Tools: Software like TextBlob, MonkeyLearn, or built-in features in data analytics platforms
like Tableau.

In essence, text mining is more concerned with the discovery of hidden patterns or trends in text
data, while text analytics focuses on structuring unstructured text data for easier analysis or
decision-making. Both are subsets of the broader field of NLP and often use similar tools and
techniques, but they are applied for different purposes.
Hot take****
AI will not replace humans
But humans who knows how to use AI will replace humans that doesn’t

Subjective = feeling
Objective = facts

Humans matters as sensors as different person may give out different opinion from multiple
perspectives.
Wider range.

Why does text mining matter?


Eg. companies collect feedback from consumers, and since the earlier generations of the
phones, now has bigger battery size
They find out what matters
It racks in profits and consumers stay loyal
The term "space-delimited lines" refers to lines of text in which individual elements within each
line are separated by spaces. In this format, a space serves as the delimiter that distinguishes
one piece of data from another. This is commonly used in text files where each line represents a
record and the space separates different fields within that record.

### Example:
Here's a simple example of space-delimited lines in a hypothetical file containing names and
ages:

```
John 25
Emily 30
Alice 22
Bob 28
```

In this example, each line represents a record, and within each line, the name and age are
separated by a space.

### Usage:
Space-delimited files are often used for simple data storage or for transferring data between
programs that can parse text files. However, they are less common than other formats like CSV
(Comma-Separated Values) or TSV (Tab-Separated Values) because spaces can appear within
the data fields themselves, causing parsing issues.

### Parsing:
To parse a space-delimited file, you would typically read the file line by line and then split each
line by spaces to get the individual fields.

In Python, for example, you could use the following code to read a space-delimited file:

```python
with open('space_delimited_file.txt', 'r') as f:
for line in f:
name, age = line.strip().split(' ')
print(f"Name: {name}, Age: {age}")
```

This would output:

```
Name: John, Age: 25
Name: Emily, Age: 30
Name: Alice, Age: 22
Name: Bob, Age: 28
```

Space-delimited lines are straightforward but should be used cautiously, especially if the data
fields can contain spaces themselves.
Topic mining: Mining within data to organize and add topics
Eg. Mining and finding posts is talking bad on someone (It wasn’t obvious)

Sentiment Analysis: Mining within data to find out the feeling on the writer
Eg. Mining and finding the emotion of the writer when writing a post

NLP common issues:


Accents affects how the model interprets data collection (Speech)
Natural Language Processing (NLP) models have made significant strides in recent years, but
they still face a range of challenges that limit their effectiveness and applicability. Here are some
common issues:

### Ambiguity
1. **Lexical Ambiguity**: A single word can have multiple meanings (e.g., "bat" could refer to an
animal or a sports equipment).
2. **Syntactic Ambiguity**: Sentences can be interpreted in multiple ways due to their structure
(e.g., "I saw the man with the telescope").

### Context Understanding


1. **Sarcasm and Irony**: Detecting sarcasm or irony is difficult because it often relies on tone,
which is not present in text.
2. **Cultural Context**: Language is often tied to cultural norms and idioms that models may not
understand.

### Scalability and Efficiency


1. **Computational Costs**: Advanced models like transformers are resource-intensive, making
them less accessible for real-time applications or low-resource environments.
2. **Data Requirements**: Many models require large, labeled datasets for training, which may
not be available for all languages or domains.

### Ethical and Bias Concerns


1. **Data Bias**: If the training data contains biases, the model will likely perpetuate those
biases.
2. **Fairness**: Models may perform differently for different demographic groups, raising
concerns about fairness and representation.

### Interpretability and Explainability


1. **Black Box Models**: Many advanced models are complex and not easily interpretable,
making it hard to understand their decisions.
2. **Lack of Justification**: NLP models often provide an output without explaining the reasoning
behind it, which is a critical issue in sensitive applications like healthcare or legal decisions.

### Language Coverage


1. **Low-Resource Languages**: Many NLP tools are optimized for English or other
high-resource languages, leaving low-resource languages underserved.
2. **Code-Switching**: People often switch between languages in a single conversation, which
can confuse NLP models.

### Domain-Specific Challenges


1. **Technical Jargon**: Domains like medicine or law use specialized language that general
NLP models may not understand.
2. **Noise in Text**: User-generated content like social media posts often contain typos, slang,
and abbreviations that can affect model performance.

### Real-World Adaptability


1. **Generalization**: Models trained on one type of data may not perform well on slightly
different data.
2. **Robustness**: Models can be sensitive to slight changes in input phrasing or format.

Addressing these challenges often involves a combination of improved algorithms, better


training data, and domain-specific adaptations. Advances in areas like transfer learning,
explainable AI, and ethical AI are also helping to mitigate some of these issues.

You might also like