Project Topic Metadata

Project Topic Advisor – Topic Standardization initiative
Problem statement – Defect identification process involves identifying keywords and grouping them
together to form an actionable topic/defect, which then is notified to the person who actions on the
defect. During this process naming a topic/defect is completely based on the person who is working on
it. He/she might name the same topic the same or differently every time the topic is created. Moreover,
the more people working in this task, the more the variation in naming the same defect/topic. This not
just look absurd but also has an implication on the reporting of the defects. Below is an example of the
same:
S. No. Review Id Defect/Topic

1. Xxx Shipment and delivery
2. Yyy Packaging
3. Zzz Shipment & Delivery
4. Aaa Shipment/Delivery
5. Bbb Shipment and delivery
6. Ccc Packaging Issues
7. ddd Shipment and delivery issues
8. eee Shipment & Delivery
9. fff Pricing
10. ggg Shipment and delivery issue
11. hhh Packaging issues
12. iii Smell
13. jjj Odor
If we are to summarize and make decisions based on the above defect data it would look something like
this (until we normalize/standardize the topics):
Defect/Topic Count
Shipment and delivery 2
Packaging 1
Shipment & Delivery 2
Shipment/Delivery 1
Packaging Issues 2
Shipment and delivery issues 1
Pricing 1
Shipment and delivery issue 1
Smell 1
Odor 1
Problems with the above approach of aggregation is the miscalculation of the data because of the
following factors:
1. General Variation – General variation is a variation of a base word based on its tenses or forms
of verbs or even spelling mistakes. As you can see that the data seems misleading because the
total count ‘Shipment and delivery’ is actually 6. The system will not be able to automatically
account for the correct count because of these variations in the word ‘Shipment and delivery’.
2. Lexicon Based Variation – ‘Smell’ and ‘Odor’ are 2 topics which basically means the same
however are represented using different words.
This may look like a simple problem to solve for if someone can manually standardize or normalize the
repeating defects every time defects need to be reported. That may be possible if the data is limited and
lesser in number. However, 4Star team has created 500+ defects/topics over the period of 1 year and
this number keeps on increasing exponentially as we scale up to add new product lines and
marketplaces.
Proposed Solution
This document will serve as a proposal towards the mentioned problem statement. The solution is
divided the solution into two parts:
1. Corrective Plan – To standardize the nomenclature of the existing topics/defects.
2. Preventive Plan – Devise a solution which will avoid the variations created during the naming process.
Levenshtein distance – String Matching Algorithm
Levenshtein distance is a string metric for measuring the difference between two sequences. Informally,
the Levenshtein distance between two words is the minimum number of single-character edits
(insertions, deletions or substitutions) required to change one word into the other. This will help us in
grouping the topics created out of general variations, like, ‘Shipment and Delivery’, ‘Shipment &
Delivery’ and ‘Shipment and delivery issues’.
WordNet – WordNet is a lexical database for the English language and is part of the NLTK corpus. It can
be used alongside the NLTK module to find the meanings of words, synonyms, antonyms, and more.
When this package is used in collaboration with Levenshtein distance, Word net will create variation of
an input ‘Topic name’ from the user based on its synonyms and will pass it through Levenshtein
distance string matching algorithm to find the closest match for the input ‘Topic Name’.
Example: User input = ‘Smell’
WordNet will create variations of the input word which may look like this:
[‘Smell’, ‘Odor’, ‘Aroma’, ‘Whiff’, ‘Scent’, ‘Fragrance’, ‘Stink’]
It will then check for the closest matching topic from database of existing topics which could look like
this:
Existing Topics Match

Taste 0%
Odor 100%
Shipment and delivery 0%
Damage 0%
Defective 0%
Flavor 0%
Packaging 0%
Based on the best match, there will be a topic suggestion i.e. ‘Odor’.
Corrective Plan
Below are the steps involved in correcting the unwanted variations:
1. Identification of all the topics/defects available to us today for all product lines across
marketplaces.
2. Passing the defects through Levenshtein distance and WordNet – String Matching Algorithm and
grouping the similar looking defects.
3. Combining and cleaning the keywords available in all the topics of the same group which will be
part of the new ‘Standardized Topic’. The cleaning task would be to remove all the unwanted
and recurring keywords.
4. Creating a new ‘Standardized topic’ in the affected workspaces.
5. Annotation, precision measurement and backfilling.
6. Making the necessary modifications in ‘Defect threshold mapping’ sheet which will be fed to the
defect alerting mechanism.
Preventive Plan
Below are the steps involved in preventing the variations to occur:
1. Create a topic suggestion solution with String and Vocabulary/dictionary based string matching
algorithm integrated in a UI wherein a user will be asked to enter a Topic and based on all the
currently available topics in textminer, it will suggest the closest match and the user will make a
decision to choose the suggested topic or not based on the actuals.
2. Create a playbook which guides a user about how to efficiently perform defect identification and
include the UI guide in it.
Topic Advisor – User Interface logic
Start
Standardized User Inputs ͚Topic

Topic Database Name͛
Yes Suggest the existing Suggested topic

Does the Topic Exist in the User Uses the
topic that matched satisfies user͛s Exit
͚Standardized Topic͛ suggested Topic.
with the Input. needs?
Database?
NO
Use the Levenstien

String matching to
find the closest
match from the
͚Standardized Topic͛
database.

Database?
NO
Use the WordNet

String matching to
find the closest
match from the
͚Standardized Topic͛
database.

Database?
NO
User will create a

new custom topic.
Stop/Exit
Risks to consider
1. Negative effects in backend table (Textminer_tags) – It is unclear as to how renaming an existing
topic affects the data in the DW table. The ideal situation would be that by renaming a topic an
instant backfill of 18 months should be triggered automatically by textminer. Old records/old
topics will be made inactiave and new topics/new records shall be available. In order to confirm
this there is a need of an experiment where few sample topics are renamed. Snapshots will be
taken pre and post the renaming in order to assess the behavior.
2. Inconsistencies in the defect alerting dashboard data – Since we are renaming the topic, the
defect alerting process should be modified in order to adjust this activity. Need to collaborate
with CPB team and the respective RAs involved in revamping the dashboard to check how to fix
this inconsistency.
3. Backfilling of the DW table with new standardized topic info – we are unsure of how bulk backfill
will perform/react on this activity. It has been evident from the past that a bulk backfill has
triggered data inconsistencies in the past. There is a need of creating a backup of at least 6
months before performing a bulk backfilling job.
4. Awareness Propagation and adoption of users – Need a well thought out plan to help users to
adopt to this tool and bring compliance into picture. Collaborate with quality team to ensure
compliance in this matter. In the end, no topic must be created without consulting this tool.

Project Topic Metadata

Uploaded by

Copyright:

Available Formats

You might also like

Project Topic Metadata

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Project Topic Metadata

Uploaded by

Copyright:

Available Formats

Project Topic Advisor – Topic Standardization initiative

S. No. Review Id Defect/Topic

Levenshtein distance – String Matching Algorithm

Example: User input = ‘Smell’

Existing Topics Match

Below are the steps involved in correcting the unwanted variations:

Below are the steps involved in preventing the variations to occur:

Standardized User Inputs ͚Topic

Yes Suggest the existing Suggested topic

Use the Levenstien

Yes Suggest the existing Suggested topic

Use the WordNet

Yes Suggest the existing Suggested topic

User will create a

You might also like