Professional Documents
Culture Documents
An Information Theoretic Scoring Function in Belief Network
An Information Theoretic Scoring Function in Belief Network
Abstract: We proposed a novel measure of mutual information known as Integration to Segregation (I2S) explaining the
relationship between two features. We investigated its nontrivial characteristics while comparing its performance in terms of
class imbalance measures. We have shown that I2S possesses characteristics useful in identifying sink and source (parent) in a
conventional directed acyclic graph in structure learning technique such as Bayesian Belief Network. We empirically
indicated that identifying sink and its parent using conventional scoring function is not much impressive in maximizing
discriminant function because it is unable to identify best topology. However, I2S is capable of significantly maximizing
discriminant function with the potential of identifying the network topology in structure learning.
Received September 9, 2012; accepted March 21, 2013; published online February 26, 2014
ease of estimation. However, in most of the cases, the features can be eliminated; however it does not
dataset is not necessarily linear in nature. Our proposed guarantee a more suitable and accurate set of markov
measure I2S is applicable to linear as well as nonlinear blanket for a class node. Statistical analysis before the
data sets. application of classifier on dataset is recommended
The aim of this study is introduce a novel [26]. This motivates us to adopt the second category in
dependency measure which is applicable to be used in which careful selection of markov blanket nodes is
capacity of a scoring functions used by learning more important.
algorithms paradigm in the context of classification, BIC score which was originally introduced by
namely, for learning the BBN classifier. We make an Schwarz [21] was framed over belief network with
attempt to empirically evaluate the merits of each score hidden features. Jensen et al. [17] discussed two
as compared to our proposed measure I2S by means of important characteristics for scoring function used in
a experimental study over benchmark datasets. These the belief network. The first characteristic is the ability
scores include well renowned scoring function of any score to balance the accuracy of a structure in
including Bayesian Information Criterion (BIC), context of structure complexity. The second
Akaike Information Criterion (AIC), Bayesian characteristic is its computational tractability. BIC is
Dirichlet (BDeu) with likelihood equivalence and a believed to satisfy both of the mentioned
uniform joint distribution, Entropy and Minimum characteristics. BIC is formally defined as:
Description Length (MDL). We shall detail out their
n size(S)
mathematical description in the next section. Recently BIC(S / D) = log 2 P(D / θ S , S) - log 2 (N) (1)
2
Carvalho et al. [7] introduced factorized Conditional Λ
Log-Likelihood (fCLL) which is an approximation of Where θ is an estimate of the maximum likelihood
the conditional log-likelihood criterion. It was a non parameters for the underlying structure. Jensen et al.
parametric measure, which was claimed empirically to [17] discussed that in case of completion of the
give significantly better results as compared to other database, BIC is reducible into problem of
measures in K2 and TAN algorithms. Such impressive determination of frequency counting such as:
scoring function motivates us to analyze its
n qi ri N ijk log 2 N n
performance over correctly judging the true topology, BIC(S | D) = ∑ ∑ ∑ N ijk log 2 (
i =1 j =1 k =1
)- ∑ q (r - 1)
i =1 i i
(2)
N ij 2
the result of which is shown in our result section.
The organization of the remaining paper is Where Nijk indicates the counts of dataset cases with
articulated as below: In section 2, some background of node Xi in its kth configuration and pa(Xi) in jth
the proposed measure is discussed. In section 3, configuration, qi denotes the number of configurations
mathematical theory of the proposed dependence over the parents for node Xi in space S and ri indicates
measure is presented. We have devoted last two the states of node Xi. Another scoring measure which
sections for empirical validation of I2S followed by depends only on equivalent sample size N’ is BDeu
discussion in detail. [6]. Carvalho et al. [7] has decomposed it into
mathematical form:
2. Related Work BDeu(B,T) = log(P(B)) + ∆ (3)
In last two decades, a lot of work has been reported in
improvements of BBN. Application of BBN in various N' N'
Γ r Γ N ijk +
n qi q i + ∑i log ri q i \
domain of interest has been highlighted [4, 25]. Cheng ∆ = ∑ ∑ log
i =1 j =1 k =1
(4)
N' N'
et al. [8] categorized BBN classifiers into two groups. Γ N ij + Γ
qi
ri q i
One is scoring based method whereas various scoring
criteria were introduced such as: Bayesian scoring MDL [18, 23] is mostly suitable to complex Bayesian
method [9, 15], entropy based method [16] and network. Mathematical formulation is composed of
minimum description length method [23]. The other explanation of Log Likelihood (LL) as following:
group focuses on analysis of dependence relationships n q i ri N ijk
among features under Conditional Independence (CI) LL(B | T ) = ∑ ∑ ∑ N ijk log (5)
i =1 j =1 k =1 N ij
test. The algorithms described in [8, 22, 27] belong to
this category. Meilaˇ and Jaakkola [21] elaborated The value of LL is used in obtaining the value of MDL
Bayesian structure learning in trees in polynomial time. as below:
In structure learning, there are two dimensions for the MDL (B | T) = LL(B | T) - (1 / 2)log(N) | B | (6)
application of features dependence. The first criterion
is selection of useful features under the assumption of |B| denotes the length of network which is achieved in
maximal statistical dependence criteria [21]. The terms of frequency calculation of a given feature’s
second category is the appropriate selection of features possible states and its parent’s state combination with
in the designing of markov blanket for a node as a feature as following:
potential candidate class [8]. In the first category, poor
An Information Theoretic Scoring Function in Belief Network 461
( )
n These include the discrete nature of the dataset with no
| B |= ∑ ri - 1 qi
i =1 (7) missing values. The second assumption is that each
Akaike Information Criterion (AIC) originally defined case of the dataset enjoys independent probabilistic
by Akaike [5] is defined mathematically: nature. It is convenient to define the I2S formally by
means of some definitions and lemma.
A IC = -2 × ln(likelihood) + 2 × K (8)
Where K denotes the number of parameters in the
3.2. Definition 1
given model. However, in belief network application, Given two features A and B, f is a relation on A.B such
its mathematical equation has been transformed into: that f:φ(Y)→X where domain(Y) = {A, B} and for
A IC(B | T) = LL(B | T)- | B | (9) every a∈A, there exist precisely one b∈B while (a,
b)∈X. φ is a real-valued function over a domain of
Recently fCLL was introduced which is an finite variables Y whereas members of domain (Y) need
approximation of the conditional log-likelihood not to be linked in conditional probability. Every
criterion [7]. Its decomposability over the network element of X is comprised of data from each distinct
structure is defined as below: state of B determined by unique outcome states of A.
^ ∧
f CL L (G | D ) = (α + β ) L L ( B | D )
3.3. Definition 2
n qi
* r
i 1 N ijk c N ijc
-βλ ∑ ∑ ∑ ∑ N ijk c log - log (10) Let X be a matrix as defined in definition 1 then value
i =1 j =1 k =1 c =0 N ij* c N ij*
of I2S is defined as:
1
N
− βλ ∑ N c log c − β N ρ
c =0 N Y ij = X ij / B j ∴ 0 ≤ i ≤ m ∧ 0 ≤ j ≤ n (11)
The authors empirically showed that their measure of n
I 2S = ∑ ( [ Max nj =1 (Y ij ) − Average nj =1 (Y ij )] × [ m / ( m − 1 )] × B j ) (12)
scoring function is not only decomposable to Bayesian j =1
elements such that: P(xij)=P(xji). In this case, value of the element Pkl. From this explanation we can conclude
I2S for matrix X will be zero. n
that: ∑ I2S ij > 0 .
Proof: Let A and B be two features in a dataset. A and j =1
B comprise of m and n distinct state outcome Hence, it is proved that matrix has lost its integration at
respectively such that: the expense of increase in its segregation.
m
P(A1 ) + P(A 2 ) + ...+ P(A m ) = ∑ P(A i ) (19) 3.7. Lemma 3
i =1
n
P ( B1 ) + P ( B 2 )..... + P ( B n ) = ∑ P ( B j )
Matrix X is said to be fully segregated with value of
j =1 (20) I2S approaching to 1 if each distinct state of feature B
Let feature set A and B comprise of m and n number of is selected by not more than one distinct state of
distinct states. Let X→A•B such that each element of X feature A.
corresponds to a∈A and b∈B such that: Proof: Let A and B be two features in a dataset. A and
m n B comprise of m and n distinct outcomes respectively
∑ ∑ Pij = 1 (21) as shown in equations 23 and 24. Let feature set A and
i j
B comprise of m and n number of distinct states. Let
Definition 3 gives the orientation of the matrix in a X→ A × B such that each element of X corresponds to
way to deduce the relationship between features A, B to mn
a∈A ∧ b∈B such that: i j ij = 1 .
∑ ∑ P
matrix X. It is clear that when definition 2 is applied on
the matrix X, then it will decrease each value by same Now, let us choose a single element PkL where 0 <= K
factor in each row and column monotonically. By <= m and 0 <= L<= n. We increase its probability at
definition 2, Max and Average of each row is also, the expense of decreasing probability ofmother elements
same in this case. By definition, I2Si=0 for each row in n
in a random fashion but with condition: ∑i ∑j Pij = 1 .
matrix. From this explanation we can conclude that:
n Definition 3 gives the orientation of the matrix in a
∑ I2S ij = 0 (22)
j =1 way to deduce the relationship between features A, B to
matrix X. We repeat this process for each row, till
Hence, it is proved that matrix is fully integrated. matrix X is transposed into a X’ such that it contains
approximately all of its probability distribution in m
3.6. Lemma 2 number of elements. As we applied definition 2 on the
If we distribute probability of each element in matrix X matrix X, then it will decrease each value by same
in a non uniform fashion then its segregation will factor in each row and column monotonically. By
increase at expense of integration. definition 2, Maximum and Average of each row is not
Proof: Let A and B be two features in a dataset. A and same in this case. By application of definition 2 we
n
B comprise of m and n distinct state outcome will get ∑ I2S ij » 0 for complete matrix. From this
j =1
respectively such that:
explanation we can conclude that the matrix is fully
m
P(A1 ) + P(A 2 )......+ P(A m ) = ∑ P(Ai ) (23) segregated. In other words, we can say that if we
i =1
distribute probability of each element in matrix X in
n such a way that one element in each row of matrix X
P(B1 ) + P(B 2 ).....+ P(B n ) = ∑ P(B j ) (24)
j =1 contains maximum probability out of the total sum of
the probability of that row, and then such matrix is
Let X→A•B given that eachm nelement of X corresponds fully segregated with value approaching to 1.
to a∈A and b∈B such that: ∑i ∑j Pij = 1 .
Now, let us choose a single element Pkl where 0 <= K
3.8. Definition 4
<= m and 0<= l<=n. We increase its probability at the Given a DAG, I2S is sensitive to order of sink and its
expense of decreasing probability of other elements
mn
in parent node. A swap will change the value of I2S such
a random fashion but with condition: ∑i ∑j Pij = 1 . that I2S(A,B)< >I2S(B,A). This characteristic is most
important to correctly identify the true order of two
Definition 3 gives the orientation of the matrix in a nodes in structure learning for decision making.
way to deduce the relationship between features A, B to
matrix X. As we applied definition 2 on the matrix X, it 3.9. Definition 5
will decrease each value by same factor in each row The measure I2S is minimum if all the input
and column monotonically. By definition 2, Max and probability measures become equiprobable. On the
Average of each row is not same in this case. Hence: other hand the measure tends to maximized if there is
by definition, I2Sij>0 for at least in the row containing maximum difference found among probability
distribution between combined probability distribution
such that f: ð(Θ) where ð is a function to calculate the
An Information Theoretic Scoring Function in Belief Network 463
⊕
maximum difference between probability distribution Step 8. I 2 S = V
parameters Θ in a cross product of both of the features Step 9. End
sink and its corresponding parent in a DAG.
It is convenient to decompose the whole of the
3.10. Lemma 4 computational methodology into three functions to
Given two nodes in BBN, I2S can detect the best elaborate it in a sensible flow. Algorithm 1 is prime
topology of two nodes. part of the computational methodology of calculation
of value of I2S. It begins with two input parameters F1
Proof: Let A and B be two features such that A holds and F2 giving final value of I2S between both of these
deterministic function B (sink). Feature A contains m features. In line 2, the function has called a function
distinct states and B contains n distinct states. GPM which is responsible for generation of pair wise
If we are to get inference of A on B such that (B | A) matrix out of two features. This function (GPM) has
then a conditional probability table C is formulated been shown in algorithm 2 where each of the features
such as {Cij / (i <= m) and (j <= n)}. The objective of is first subjected to calculate the unique values
the naïve Bayes classifier is to: involved in the feature. A Matrix M is born given the
dimension of length of unique values of feature 1 and 2
as number of rows and columns of matrix M
f 1 : C 1 1 , C 1 2 , ......C 1 i respectively. These steps are pointed out from step 2
M ax im iz e f 2 : C 2 1 , C 2 2 , ......C 2 i (25)
to step 5 in which unique values are calculated U1, U2
f : C ,C and then length of the unique values L1, L2 are
3 i1 i2 , ......C ii
calculated. For the convenience of understanding of the
There are two possible orientations for the topology of procedure outlined in this function, we have made
two nodes. A→B or B→A. We apply integration to assumption that in each feature the unique values are
segregation measure on both of these topologies such consecutive. This can be achieved by first decoding
that: I2S1: A→B and I2S2: B→A. each feature attribute of the dataset. The rest of the
As I2S is an asymmetric measure therefore: lines in the function (GPM) simply count the number
of distinct values of both of the features in a pair wise
I2 S 2 - I2 S 1 > 0 fashion by populating the resulting matrix M. Now, we
I2 S
2 - I2 S 1 = 0
(26) shall again continue our explanation on our prime
I2 S 2 - I2 S 1 < 0
function Find_I2S.
Every function in Equation 1 will be increased with the Algorithm 2: generate pair-wise matrix
increase in the difference between its corresponding Input: F1 , F2
variables. The same is true for I2S when each I2S will Output: Trans
be increased as its underlying features are getting less Function [M] = GPM(F1, F2)
⊲
segregated with increase in integration. Hence, it is Step 1. U 1 = F
1
1
maximization of probabilistic function defined in Step 3.U = ⊲
each row of the matrix M is updated by dividing of learning. Thus, we have provided this information
corresponding element of vector B such that row under the assumption that there must be a relationship
number of both numerator and denominator are same. of scoring function versus these class imbalance
Another function I2SE is called on in line 8. This characteristics to judge how accurate a structure is
function is responsible to define each of the individual defined while describing the position of both of the
elements of the final outcome vector of I2S. The nodes. Table 2 indicates a gradual increase and
underlying logic to compute the value of each element decrease in the value of I2S versus other accuracy
of vector V is determined by finding maximum and measures.
mean of the each row of the pair wise matrix M. A
Table 1. Class imbalance characteristics (Zoo dataset).
difference of maximum and minimum if multiplied by
number of elements in the row. The resulting scalar S. Sink Class Recall FPR Prec F-M
value is divided by number of element in the row with 1 1 2 0.802 0.802 0.643 0.714
degree of freedom 1. 2 1 0.624 0.279 0.8 0.59
2 3 7 0.554 0.554 0.307 0.396
Algorithm 3: calculate I2S elemental value 7 3 0.584 0.584 0.341 0.431
Input: Inp 3 11 17 0.406 0.36 0.179 0.248
Output: I2S 17 11 0.921 0.921 0.848 0.883
Function [ret] = I2SEl (Inp) 4 4 5 0.762 0.762 0.581 0.66
≺ 5 4 0.545 0.458 0.559 0.549
Step 1. mx = Inp
5 6 9 0.822 0.822 0.675 0.741
Step 2. mx = Inp 9 6 0.644 0.644 0.414 0.504
col
Step 3. cnt = | Inp | 6 10 16 0.535 0.572 0.458 0.436
16 10 0.792 0.792 0.627 0.7
Step 4. ret = (mx - mn ) * (cnt)/ (cnt-1)
7 8 12 0.832 0.832 0.692 0.755
Step 5. End
12 8 0.604 0.604 0.365 0.455
At the end of loop which is marked by threshold 8 13 14 0.832 0.46 0.845 0.805
value sz, we achieve a final vector V with same 14 13 0.406 0.28 0.194 0.261
9 15 16 0.525 0.595 0.308 0.388
number of elements as that of the number of rows in
16 15 0.871 0.871 0.759 0.811
matrix M. In line 15, a complete sum of all of the
10 4 13 0.465 0.206 0.417 0.439
elements in the vector is obtained which is our final
13 4 0.832 0.192 0.831 0.831
desirable outcome value known as I2S.
In this section, we discussed mathematical theory of There were seventeen features available in the zoo
the proposed measure. First we show that theoretically dataset for which 136 (17×16→272/2) pair-wise
asymmetric property of dependence measure is more structures can be built. We pick only three pair results
important in judgment of sink and source node. In the in Tables 1 and 2 randomly ten pairs in such a way that
second part of this section, we highlighted algorithmic almost every feature can be shown by the Table 1. We
steps involved in the calculation of I2S. In the next justify the adoption of pair wise sets of features in
section of result, we present empirical results to terms of introduction of non augmented classifier in
validate our claims made in the current and previous which any feature can be considered as a class while
sections. others features are treated for a selection of its
corresponding potential markov blanket. Take an
4. Results and Discussion example from Table 1 in which serial 1 is indicating
We carried out experimentation on fourteen benchmark two pairs with feature number one and feature number
dataset obtained from UCI [5] as shown by Figures 1 two. Firstly, feature two was considered parent while
and 2. All of these dataset contain multinomial/ the other feature number one was assumed as its sink
categorical features; hence, they were fit for our or single markov blanket node. In second experiment,
experimentation. First we explain the detail of the positions of both nodes were swapped. We analyzed
experiment as shown by the Tables 1 and 2. Both of accuracy measures against scoring functions as shown
these tables indicate detail of a few experiments by Tables 1 and 2. This part of experiment for
performed on one benchmark dataset Zoo [5]. First two reasoning of uncertainty was performed in WEKA
columns of the Table 1 indicates two features in which [14]. General parameters for classification in WEKA
one is designated as sink while other is marked as class experimentation were simple estimator and 10 fold
node. The last four columns of the Table 1 shows class cross validation while K2 score type chosen is Bayes.
imbalance characteristic (accuracy measures) of the A careful examination of Tables 1 and 2 indicates that
experiment including Recall or True Positive Rate a higher value of I2S ensures a higher precision, recall,
(TPR), False Positive Rate (FPR), Precision and F- TPR and lower FPR for example when feature 1 is
Measure. In literature, it was termed that maximization selected as parent and feature 2 as sink node then
of the scoring function guarantee the correct structure Recall TPR, FPR, Precision and F-Measure values are
An Information Theoretic Scoring Function in Belief Network 465
0.802, 0.802, 0.643 and 0.714 respectively. A swap of Figures 1 and 2 have shown a detailed summary of the
both of the features yields the values of results. The Figure 1 indicates the number of correct
0.624,0.279,0.8 and 0.59 for Recall TPR, FPR, topologies revealed by various scoring function while
Precision and F-Measure. Here, the question arises the Figure 2 points out the number of in correct
how we can relate it to the scoring values presented in topologies drawn by well known and established
Table 2. scoring function along with our proposed measure.
so that it can also, print out the final fCLL score at the five of them are implemented in WEKA and the sixth
exit of each cycle. However, we have restricted to the measure (fCLL) has been provided with online source
provision of its general result because detail of its code written as WEKA extension for delivering a
scores come up with the conclusion that there was no reliable result in case of exercising belief network
particular fashion found in the scoring function of classifier. The empirical validation was carried out
fCLL in comparison to accuracy measures. In many indicating that there is a difference in class imbalance
cases, an increase in the fCLL value delivers a low measure for the results achieved on various dataset. The
value in TPR. Hence, there was no arguing in results clearly indicate that a higher value of I2S means
describing its score. The potential reason behind this higher accuracy in terms of various accuracy measures
‘anomalous fashion’ is that the said scoring functions like precision, recall, f-measure and false positive rate.
evolves in the domain of TAN, C4.5, K-NN, SVM and However, this conclusion is not uniform because the
LogR classifiers; but our experimentation uses K2 as difference in the I2S for both of possible configuration
searching algorithm. is not in proportionate with corresponding difference in
At this point, it is useful to reiterate the lemma-4 class imbalance metrics. There are two underlying
which we mentioned previously. We described that possibilities: either a change in dataset or the second
“Given two nodes in BBN, I2S is prone to detect the reason lies in the inability of proposed measures to
best topology of two nodes”. We can observe the same explain it vigorously and in more depth. This left us
out of the experimentation that all conventional scoring with the option of improvement in the proposed
functions exhibit minor or major deviations from the measure so, that a proportionate difference in class
lemma 4. The scoring function fCLL is an exceptional imbalance metrics and I2S can be observed. We have
notaion which in most of the cases gives no significant left this work for future work. Another future direction
result. Hence, we can conclude that empirically a better for this work is to introduce a new breed of classifiers
scoring function is the one which is capable of with quite reasonable performance exhibiting an
exhibiting true topology of the given features. These improved accuracy. We also, plan to come up with a
experiments explain that I2S possess a relevant commercial classifier based on our introduced measure.
characteristic in development of topology of structure Third future direction for this work lies in tailoring the
learning in Bayesian belief network. However, other measure to make it adaptable for a wide variety of
mutual information measures can not indicate the dataset. Currently the measure is ideally suitable for
correct topology of the dataset. multinomial (categorical) features only, whereas there
is a need to encompass features other than multinomial
5. Conclusions dataset.
[7] Carvalho A., Roos T., Oliveira A., and Models and Domain Information, Uncertainty in
Myllymäki P., “Discriminative Learning of Artificial Intelligence,” in Proceedings of the 5th
Bayesian Networks via Factorized Conditional Conference on Uncertainty in Artificial
Log-Likelihood,” Journal of Machine Learning Intelligence, Ontario, Canada, pp. 343 - 350,
Research, vol. 12, no. 7, pp. 2181 - 2210, 2011. 1989.
[8] Cheng J., Bell D., and Liu W., “An Algorithm for [23] Suzuki J., “Learning Bayesian Belief Networks
Bayesian Belief Network Construction from Based on the MDL Principle: An Efficient
Data,” in Proceedings of AI & STAT’, Algorithm Using the Branch and Bound
Lauderdale, Florida, pp. 83 - 90, 1997. Technique,” in Proceedings of the International
[9] Cooper G. and Herskovits E., “A Bayesian Conference on Machine Learning, BaIly, Italy,
Method for the Induction of Probabilistic pp. 356 - 367, 1996.
Networks from Data,” Machine Learning, vol. 9, [24] Tlemsani R. and Benyettou A., “On Line Isolated
no. 4, pp. 309 - 347, 1992. Characters Recognition Using Dynamic Bayesian
[10] Corder G. and Foreman D., Nonparametric Networks,” the International Arab Journal of
Statistics for Non-Statisticians: A Step-by-Step Information Technology, vol. 8, no. 4, pp. 406 -
Approach, John Wiley & Sons, New York, USA, 413, 2011.
2009. [25] Velarde-Alvarado P., Vargas-Rosales C., Torres-
[11] Duwairi R., “Arabic Text Categorization,” the Roman D., and Martinez-Herrera A., “An
International Arab Journal of Information Architecture for Intrusion Detection Based on an
Technology, vol. 4, no. 2, pp. 125 - 131, 2007. Extension of the Method of Remaining
[12] Gibbons J. and Subhabrata C., “Nonparametric Elements,” Journal of Applied Research and
Statistical Inference,” Marcel Dekker, New York, Technology, vol. 8, no. 2, pp. 159 - 176, 2010.
USA, 2003. [26] Wasserman L., All of Nonparametric Statistics,
[13] Grimmett G. and Stirzaker D., Probability and Springer, New York, USA, 2007.
Random Processes, Oxford University Press, [27] Wermuth N. and Lauritzen S., “Graphical and
USA, 2001. Recursive Models for Contingency Tables,”
[14] Hall M., Frank E., Holmes G., Pfahringer B., Biometrika, vol. 70, no. 3, pp. 537 - 552, 1983.
Reutemann P., and Witten I., “The WEKA Data
Mining Software: An Update,” ACM SIGKDD Muhammad Naeem Research
Explorations Newsletter, vol. 11, no. 1, pp. 10 - scholar at department of computer
18, 2009. science, M. A. Jinnah University
[15] Heckerman D., Geiger D., and Chickering D., Islamabad Pakistan. His research
“Learning Bayesian Networks: the Combination area includes machine learning, text
of Knowledge and Statistical Data,” Machine retrieval, graph mining,
Learning, vol. 20, no. 3, pp. 197 - 243, 1994. classification and data mining.
[16] Herskovits E., “Computer-Based Probabilistic
Network Construction,” Doctoral Dissertation, Sohail Asghar is Director/ Associate
Stanford University, Stanford, CA, 1991. Professor at University Istitute of
[17] Jensen F. and Nielsen T., Bayesian Networks and Information Technology at PMAS-
Decision Graphs, Springer, New York, 2007. Arid Agriculture University
[18] Lam W. and Bacchus F., “Learning Bayesian Rawalpindi Pakistan. He earns PhD
Belief Networks: An Approach Based on the in Computer Science from Monash
MDL Principle,” Computational Intelligence, University, Melbourne, Australia in
vol. 10, no. 3, pp. 269 - 294, 1994. 2006. Earlier he did his Bachelor of Computer Science
[19] Meilaˇ M. and Jaakkola T., “Tractable Bayesian (Hons) from University of Wales, United Kingdom in
Learning of Tree Belief Networks,” Statistics and 1994. His research interest includes data mining,
Computing, vol. 16, no. 1, pp. 77 - 92, 2006. decision support system and machine learning.
[20] Messikh L. and Bedda M., “Binary Phoneme
Classification Using Fixed and Adaptive
Segment-Based Neural Network Approach,” the
International Arab Journal of Information
Technology, vol. 8, no. 1, pp. 48 - 51, 2011.
[21] Schwarz G., “Estimating the Dimension of a
Model,” Annals of Statistics, vol. 6, no. 2, pp.
461 - 464, 1978.
[22] Srinivas S., Russell S., and Agogino A.,
“Automated Construction of Sparse Bayesian
Networks from Unstructured Probabilistic