Below are some useful Machine Learning question and answer.

Which ONE of the following are regression tasks?

**A) Predict the age of a person**

B) Predict the country from where the person comes from

C) Predict whether the price of petroleum will increase tomorrow

D) Predict whether a document is related to science

**Answer: A**

Which of the following is a supervised learning problem?

A) Grouping people in a social network.

**B) Predicting credit approval based on historical data**

**C) Predicting rainfall based on historical data**

D) all of the above

**Answer: B and C**

Which of the following are classification tasks? (Mark all that apply)

**A) Find the gender of a person by analyzing his writing style**

B) Predict the price of a house based on floor area, a number of rooms etc.

**C) Predict whether there will be abnormally heavy rainfall next year**

D) Predict the number of copies of a book that will be sold this month

**Answer: A, C**

Which of these are categorical features?

A) A height of a person

B) Price of petroleum

**C) Mother tongue of a person**

D) Amount of rainfall in a day

**Answer: C**

Occam’s razor is an example of

**A) Inductive bias **

B) Preference bias

**Answer: A**

How does generalization performance change with increasing size of the training set?

**A) Improves **

B) Deteriorates

C) No Change

D) None

**Answer: A **

In regression the output is

A) Discrete.

B) Continuous and always lies in a finite range.

**C)Continuous.**

D) Maybe discrete or continuous.

**Answer: C**

In linear regression the parameters are

A) strictly integers

B) always lies in the range [0,1]

**c)any value in the real space**

D) any value in the complex space

**Answer: C**

Which of the following is true for a decision tree?

A) A decision tree is an example of a linear classifier.

**B) The entropy of a node typically decreases as we go down a decision tree.**

C) Entropy is a measure of purity.

D) An attribute with lower mutual information should be preferred to other attributes.

**Answer: B**

Given a list of 14 examples including 9 positive and 5 negative examples. The entropy of the dataset with respect to this classification is

**A) 0.940**

B) 0.06

C) 0.50

D) 0.22

**Answer: A **

Decision trees can be used for the following type of datasets:

I. The attributes are categorical

II. The attributes are numeric valued and continuous

III. The attributes are discrete-valued numbers

A) In case I only

B) In case of II only

C) In cases II and III only

**D)In cases I, II and III**

**Answer: D**

How does generalization performance change with increasing size of the training set?

**A) Improves**

B) Deteriorates

C) No Change

D) None

**Answer: A**

One of the most common uses of Machine Learning today is in the domain of Robotics. Robotic tasks include a multitude of ML methods tailored towards navigation, robotic control and a number of other tasks. Robotic control includes controlling the actuators available to the robotic system. An example of this is the control of a painting arm in automotive industries. The robotic arm must be able to paint every corner in the automotive parts while minimizing the quantity of paint wasted in the process. Which of the following learning paradigms would you select for training such a robotic arm?

A) Supervised learning

B) Unsupervised learning

C) Combination of supervised and unsupervised learning

**D) Reinforcement learning**

**Answer:D**

In a K-NN algorithm, given a set of training examples and the value of < size of the training set ( ), the algorithm predicts the class of a test example to be the

**A) Most frequent class among the classes of closest training examples.**

B) Least frequent class among the classes of closest training examples.

C) Class of the closest point.

D) Most frequent class among the classes of the farthest training examples.

**Answer: A**

In collaborative Filtering based Recommendation, the items are recommended based on which of the following?

**A) Similar users**

B) Similar items

C) Both A and B

D) None

**Answer: A**

Which of the following are advantages of the large value of in K-NN algorithm?

A) Less sensitive to noise.

B) Better probability estimates for discrete classes.

C) Larger training sets allow larger values of.

**D) All of the above.**

**Answer: D**

For which of the following cases Dimensional reduction may be used?

A) Data Compression

B) Data Visualization

C) To prevent overfitting

**D) Both A and B**

**Answer: D**

Which of the following is the limitation of Collaborative Filtering?

A) Over specialization

**B) Cold start**

C) Both A and B

D) None

**Answer: B**

Which of the following statements is true about PCA?

(i) We must standardize the data before applying PCA.

(ii) We should select the principal components which explain the highest variance

(iii) We should select the principal components which explain the lowest variance

(iv) We can use PCA for visualizing the data in lower dimensions

**A. (i), (ii) and (iv)**

B. (ii) and (iv)

C. (iii) and (iv)

D. (i) and (iii)

**Answer: A**

In feature selection, which of the following techniques can be used to find a subset of features?

A) Sequential forward search

B) Sequential backward search

**C) Both A and B**

D) None of A or B

**Answer: C**

[True or False] A Pearson correlation between two variables is zero but, still, their values can still be related to each other.

**A) TRUE**

B) FALSE

**Answer: A**

Bayesian Network is a graphical model that efficiently encodes the joint probability distribution for a large set of variables.

**A)True**

B) False

**Answer: A**

A fair coin is tossed three times and a T (for tails) or H (for heads) is recorded, giving us a list of length 3. Let X be the random variable which is zero if no T has another T adjacent to it, and is one otherwise.Let Y denote the random variable that counts the number of T’s in the three tosses.

Find P(X=1, Y=2).

A)1/8

**B) 2/8 **

C) 5/8

D) 7/8

**Answer: B**

Two cards are drawn at random from a deck of 52 cards without replacement. What is the probability of drawing a 2 and an Ace in that order?

A) 4/51

B) 1/13

C) 4/256

**D) 4/663**

**Answer: D**

A and B throw alternately a pair of dice. A wins if he throws 6 before B throws 7 and B wins if she throws 7 before A throws 6. If A begins, his chance of winning would be:

**A) 30/61**

B) 31/61

C) 1/2

D) 6/7

**Answer: A**

Diabetic Retinopathy is a disease that affects 80% of people who have diabetes for more than 10 years. 5% of the Indian population has been suffering from diabetes for more than 10 years. Answer the following questions. What is the joint probability of finding an Indian suffering from Diabetes for more than 10 years and also has Diabetic Retinopathy?

A) 0.024

**B) 0.040**

C) 0.076

D) 0.005

**Answer: B**

Which of the following is false about support vectors?

A) The support vectors are the subset of data points that determine the max-margin separator.

B) The Lagrangian multipliers corresponding to the support vectors are non-zero.

C) The support vectors are used to decide which side of the separator a test case is on.

**D) The max-margin separator is a non-linear combination of the support vectors.**

**Answer: D**

Consider a binary classification problem.Suppose I have trained a model on a linearly separable training set, and now I get a new labeled data point which is correctly classified by the model, and far away from the decision boundary. If I now add this new point to my earlier training set and re-train, in which cases is the learnt decision boundary likely to change?

A) When my model is a perceptron.

**B) When my model is logistic regression.**

C) When my model is an SVM.

**D) When my model is Gaussian discriminant analysis.**

**Answer: B and D**

After training an SVM, we can discard all examples which do not support vectors and can still classify new examples?

**A) TRUE**

B) FALSE

**Answer:A**

If g(z) is the sigmoid function, then its derivative with respect to z may be written in term of g(z) as

**A) g(z)(1-g(z))**

B) g(z)(1+g(z))

C) -g(z)(1+g(z))

D) g(z)(g(z)-1)

**Answer: A**

Which of the following are true when comparing ANNs and SVMs?

**A) ANN error surface has multiple local minima while SVM error surface has only one minima**

** B) After training, an ANN might land on a different minimum each time, when initialized with random weights during each run.**

** C) In training, ANN’s error surface is navigated using a gradient descent technique while SVM’s error surface is navigated using convex optimization solvers.**

D) As shown for Perceptron, there are some classes of functions that cannot be learnt by an ANN. An SVM can learn a hyperplane for any kind of distribution.

**Answer: A, B, C**

Which of the following is not a kernel function?

A) K(xi,xj)=xi.xj

**B) K(xi,xj)=(1 – xi.xj)3**

C) K(xi,xj)=e(-‖xi-xj‖2 /(2σ2))

D) K(xi,xj)= tanh(β0xi.xj+β1)

**Answer: B**

Which of the following is true about SMO algorithm (multiple answers)?

A) The SMO can efficiently solve the primal problem.

**B) The SMO can efficiently solve the dual problem**

** C) The SMO solves the optimization problem by co-ordinate ascent.**

D) The SMO solves the optimization problem by coordinate descent.

**Answer: B, C**

Which of the following is/are true about the Perceptron classifier?

**A) It can learn an OR function**

** B) It can learn a AND function**

** C) The obtained separating hyperplane depends on the order in which the points are presented in the training process.**

D) For a linearly separable problem, there exists some initialization of the weights which might lead to non-convergent cases.

**Answer: A, B, and C**

The back-propagation learning algorithm applied to a two-layer neural network

A) always finds the globally optimal solution.

**B) finds a locally optimal solution which may be globally optimal.**

C) never finds the globally optimal solution.

D) finds a locally optimal solution which is never globally optimal

**Answer: B**