IMDB Scraping & Analysis

Interview Notes (Project Related)
1. IMDB Scraping & Analysis

a. The difference between span and div is that a span element is in-line and usually used for
a small chunk of HTML inside a line (such as inside a paragraph) whereas
a div (division) element is block-line (which is basically equivalent to having a line-break
before and after it) and used to group larger chunks of code.
b. Difference between lxml and html.parser:
html.parser - BeautifulSoup(markup, "html.parser")
Advantages: Batteries included, Decent speed, Lenient (as of Python 2.7.3 and
3.2.)
Disadvantages: Not very lenient (before Python 2.7.3 or 3.2.2)
lxml – BeautifulSoup (markup, "lxml")
Advantages: Very fast, Lenient
Disadvantages: External C dependency
c. Parsed 5 pages per year.
d. Basic html, css and javascript knowledge required.
e. From the IMDB histogram, we can see that most ratings are between 6 and 8. There are
few movies with a rating greater than 8, and even fewer with a rating smaller than 4.
This indicates that both very good movies and very bad movies are rarer.
f. The distribution of Metascore ratings resembles a normal distribution - most ratings are
average, peaking at the value of approximately 50. According to this distribution, there
are indeed fewer very good and very bad movies, but not that few as the IMDB ratings
indicate.
g. Created a pie chart and histogram.
h. Get () function from requests library is to request the server the content of the page.
i. BeautifulSoup is for parsing the HTML document requested and extract useful data.
2. DataCan: Genome Cancer Data Analysis

1. Submitted a research paper.
2. 68.36% accuracy achieved with randomforestclassifier using GridSearchCV and 10-fold
cross validation.
3. 61% accuracy achieved with SVC using pipeline of BaseEstimator and
TransformerMixin to extract data from text and csv data. GridSearchCV and 10-fold
cross validation didn't improve the accuracy.
4. Defined 9 classes based of genes present in the class.
5. Future work includes using Deep learning frameworks such as LSTM for classification
and better preprocessing of text data.
6. Low accuracy because genes and their mutations didn’t follow the correct nomenclature.
So, lost a lot of data during conversion of text to numerical form.
7. NLTK was used to extract important genetic words from text by removing stopwords and
special characters and one-hot-encoding them to convert to numerical values.
8. Classified mutations into respective 17 types using regular expression.
9. Faced difficulty in handling text data as proper nomenclature was not followed. And also,
during defining classes. Some research work was required to understand mutations and
their types.
10. Confusion matrix was plotted for better representation and understanding of model’s
accuracy for each class. Classes 8 & 9 had very less data and received higher accuracy,
while some classes such as 2, 3 & 5 achieved low accuracy due to more data.
3. Image Predictor using ResNet Model

a) Flask based web app.
b) Predicts the input image given by user.
c) Re-trained ResNet50 model from keras.applications library with pre-trained weights.
Loads the image and converts it into specific size required by the model I.e. 224,224.
d) Save the trained model into a json file and its weights.
e) Loss used categorical_crossentropy, optimizer used Adam, and metrics used is accuracy.
f) Achieved an accuracy of 99%.
g) ResNet50 was used because it solves the problem of degrading accuracy in neural
networks. It is a 50-layer residual network which uses shortcut connections (directly
connecting input of nth layer to some (n+x) th layer). More layers but faster execution.
h) Categorical_crossentropy used because it is best for multiclass classification.
i) Json file is easier to save and load.
4. Market Forecast of newly launched products

 The live Machine Learning project I worked on was 'Improving Market Forecast for newly
launched products’. The problem statement was to predict the sales of any product newly
launched in the market and we have taken into consideration some market factors, price and
an innovation factor. Since there was no data for the new product, a dataset of existing
laptops with some specifications and price and sales was used. The dataset obtained had
some features with no significance or some null values. The columns with zero significance
were removed and the null values in sales column was filled using mean value. This dataset
was used for further training of our models.
 Our project is a Flask based web app, where the client enters the specifications of his
product. These new specifications are stored with the existing dataset and K-Means
clustering is used to find the existing laptops with similar specifications. Now, Multi-Linear
regression is used on this cluster to find the sales of new product. Then, we designed an
innovation potential which incorporates various market conditions such as 'Is the product
economical?', 'Is it useful in current scenarios? ', 'Does usage of technology increase social
prestige?' etc. and also How new is this new technology in the company? The calculated
potential was normalized between 0.75 - 1.25 and multiplied with predicted sales to find final
predicted sales. If the potential is below 1, the sales will decrease and if it is above 1, sales
will increase.
 We used a combination of K-Means and Multi-Linear Regression as it produced the best
accuracy on training set. Other regression algorithms such as Support Vector Regressor,
Decision Trees and Random Forest Regressor were also used but achieved a score lesser than
our proposed solution.
 Future scope includes incorporating some other market factors such as festival season,
location-based prediction etc. Using Restricted Boltzmann Machine for better accuracy.
 Faced problems in finding the appropriate dataset for the problem, integration of models with
flask, designing an innovation potential while incorporating market factors and price.
Q1. Why interested in Data Science and Machine Learning?

I started in this field with a summer course on Big Data Hadoop a year ago. A large
amount of data is being generated every minute today and I was amazed to see the numerous
applications of data science in using that data to find insights and solve various problems in
almost all domains including medical, finance, crime, agriculture, social media, marketing etc.
Then I opted for python for one of my electives where I was introduced to various libraries of
python commonly used in data science such as numpy, pandas, matplotlib etc. Then I started
with a mooc course on Machine Learning.
Q2. Experience in ML and AI?

I have done different domain specific projects in data science, ml and ai.
Q3. Various other uses of ML and AI?

 Chatbots used in Zomato for customer feedback and queries
 e-commerce with predicting their potential customers
 healthcare diagnosing cancer and various other diseases at a rate faster than any human can
 Google Maps or Uber (Commuting) estimating arrival time and optimized paths
 Emails (Spam filtering)
 Market Forecasting of newly launched products
 Plagiarism Checker for research papers
 Fraud prevention by detecting a fraud before it has happened
 Recommendation Engines (retail) various e-commerce websites providing a personal
experience
 Gaming controlling the whole environment of the game with respect to the choices opted by
the player
 Google assistant converting speech to text to numerical form for machine understanding
 Self-Driving Cars detecting nearby environment and learning the best way to deal with it.
 Sophia the first humanoid.
Q4. Tell me something about yourself?

Thank you, Sir/Ma’am for this opportunity. I am from Delhi and belong to a nuclear
family. My father is a Businessman and my mother is a house wife. My younger brother will
appear for his 12th board this year. We are a very close, loving, caring and sharing family. On an
individual front, I perceive myself as a confident, conscientious and hardworking individual,
who is never afraid of challenges. I carry out any task assigned to me without hesitation,
provided the instructions are clear. In the case of doubts, I never hesitate to put forth my
questions. I have always been a fast learner. My greatest weakness is that I don't like getting
interrupted when I am seriously into something. Another one of my weaknesses is that I trust
people very easily. I have keen interest in Data Science and its subsets Machine Learning and
Artificial Intelligence related fields. My hobbies include Gaming and eating, as you might have
already noticed. I am good at socializing with others, even in new surroundings. I am currently
pursuing B.Tech. in IT from Manipal University Jaipur. I did my schooling in Delhi from Amity
International School, Pushp Vihar branch.
Q5. Do you have any questions?

What projects I might get if selected. I want to learn about you, can you tell me
something about yourself. About the merger between Duck Cre ek Technologies and
Mindtree.
Q6. Do you aspire to pursue higher studies?
As soon as I graduate, I want to start working in the real world. I have completed 4 different
domain specific projects which gave me some impression of real-life problems.
Q7. Tell me about your greatest fear.

I might seem to you that I have never worked in my life, and inexperience is my
weakness, but I beg to differ. I have always been a fast learner with a lot of confidence and open-
minded approach to solving problems. I assure that if given a chance I can prove my smart
working nature.
Q8. Would you like to relocate?

Yes, I am comfortable with relocating. It helps an individual to work and adapt in
different work environments and conditions. One gets to meet different people and learn from
them
Mindtree: -
1. Delivering Digital transformation and technology services, IT and services company.
2. Founded in 1999, with headquarters in Bangalore.
3. Founded by Subroto Bagchi, Krishnakumar Natrajan, Scott Staples and Ashok Soota.
Current CEO Rostow Ravanan since 2016.
4. Mindtree is an IT services and outsourcing company that works on e-commerce,
mobility, cloud enablement, digital transformation, business intelligence, data analytics,
testing, infrastructure, EAI and ERP solutions.
5. Their slogan “Welcome to possible” reflect the ideology behind their work culture.
6. Chennai, Bangalore, Hyderabad, Pune
7. Flooresense is Mindtree’s intelligent real-time recommendation platform. It guides the
appropriate store associate to a shopper who is likely to become a customer if given the
necessary assistance.
8. There is a lot of career growth opportunities because MindTree works on latest
technologies and projects, where I get hands on experience. There are role-based
certifications for various profiles and also work life balance. It is a global company with
various career opportunities in USA, which can help me work across different cultures. It
is a data driven company with deep roots in data science and providing several data
analytics services such as BI, Big Data, Frameworks, data visualization. In 2016, it was
Ranked under 3 for Best CEO and Analytics Days and was Mentioned in India’s Super
50.
Merkle Sokrati
1. Merkle Sokrati is a leading digital and technology, and big data analytics firm, heavily
investing in automation and artificial intelligence.
2. An AD technology and Analytics company with focus on paid marketing channels
3. Founded in March 2009, by Ashish Mehta, Anubhav Sonthalia and Santosh
Gannavarapu.
4. Great work culture.
5. Providing solutions in E-commerce, telecom, entertainment, travel and marketing
6. Ranked Top 50 in Deloitte’s Technology Fast 50, 4 times in a row.
7. A start-up which can provide me with immense opportunities in my area of interest. I will
get the right exposure and work experience which will help me a lot in starting my career.

IMDB Scraping & Analysis

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

IMDB Scraping & Analysis

Uploaded by

Copyright:

Available Formats

Interview Notes (Project Related)

1. IMDB Scraping & Analysis

2. DataCan: Genome Cancer Data Analysis

3. Image Predictor using ResNet Model

4. Market Forecast of newly launched products

Q1. Why interested in Data Science and Machine Learning?

Q2. Experience in ML and AI?

Q3. Various other uses of ML and AI?

Q4. Tell me something about yourself?

Q5. Do you have any questions?

Q6. Do you aspire to pursue higher studies?

Q7. Tell me about your greatest fear.

Q8. Would you like to relocate?

You might also like