Answers:
A probabilistic graphical model (PGM) is a graph formalism for compactly modeling joint probability distributions and (in)dependence relations over a set of random variables. A PGM is called a Bayesian network when the underlying graph is directed, and a Markov network/Markov random field when the underlying graph is undirected. Generally speaking, you use the former to model probabilistic influence between variables that have clear directionality, otherwise you use the latter; in both versions of PGMs, the lack of edges in the associated graphs represent conditional independencies in the encoded distributions, although their exact semantics differ. The "Markov" in "Markov network" refers to a generic notion of conditional independence encoded by PGMs, that of a set of random variables being independent of others given some set of "important" variables (the technical name is a Markov blanket), i.e. .
A Markov process is any stochastic process that satisfies the Markov property. Here the emphasis is on a collection of (scalar) random variables typically thought of as being indexed by time, that satisfies a specific kind of conditional independence, i.e., "the future is independent of the past given the present", roughly speaking . This is a special case of the 'Markov' notion defined by PGMs: simply take the set , and take to be any subset of and invoke the previous statement . From this we see that the Markov blanket of any variable is its predecessor .
Therefore you can represent a Markov process with a Bayesian network, as a linear chain indexed by time (for simplicity we only consider the case of discrete time/state here; picture from Bishop's PRML book): This kind of Bayesian network is known as a dynamic Bayesian network. Since it's a Bayesian network (hence a PGM), one can apply standard PGM algorithms for probabilistic inference (like the sum-product algorithm, of which the Chapman−Kolmogorov Equations represent a special case) and parameter estimation (e.g. maximum likelihood, which boils down to simple counting) over the chain. Example applications of this are the HMM and n-gram language model.
Often you see a diagram depiction of a Markov chain like this one
This is not a PGM, because the nodes are not random variables, but elements of the state space of the chain; the edges correspond to the (non-zero) transitional probabilities between two consecutive states. You can also think of this graph as describing the CPT (conditional probability table) of the chain PGM. This Markov chain only encodes the state of the world at each time stamp as a single random variable (Mood); what if we want to capture other interacting aspects of the world (like Health, and Income of some person), and treat as a vector of random variables ? This is where PGMs (in particular, dynamic Bayesian networks) can help. We can model complex distributions for using a conditional Bayesian network typically called a 2TBN (2-time-slice Bayesian network), which can be thought of as a fancier version of the simple chain Bayesian network.
TL;DR: a Bayesian network is a kind of PGM (probabilistic graphical model) that uses a directed (acyclic) graph to represent a factorized probability distribution and associated conditional independence over a set of variables. A Markov process is a stochastic process (typically thought of as a collection of random variables) with the property of "the future being independent of the past given the present"; the emphasis is more on studying the evolution of the the single "template" random variable across time (often as ). A (scalar) Markov process defines the specific conditional independence property and therefore can be trivially represented by a chain Bayesian network, whereas dynamic Bayesian networks can exploit the full representational power of PGMs to model interactions among multiple random variables (i.e., random vectors) across time; a great reference on this is Daphne Koller's PGM book chapter 6.
First a few words about Markov Processes. There are four distinct flavours of that beast, depending on the state space (discrete/continuous) and time variable (discrete/ continuous). The general idea of any Markov Process is that "given the present, future is independent of the past".
The simplest Markov Process, is discrete and finite space, and discrete time Markov Chain. You can visualize it as a set of nodes, with directed edges between them. The graph may have cycles, and even loops. On each edge you can write a number between 0 and 1, in such a manner, that for each node numbers on edges outgoing from that node sum to 1.
Now imagine a following process: you start in a given state A. Every second, you choose at random an outgoing edge from the state you're currently in, with probability of choosing that edge equal to the number on that edge. In such a way, you generate at random a sequence of states.
A very cool visualization of such a process can be found here: http://setosa.io/blog/2014/07/26/markov-chains/
The takeaway message is, that a graphical representation of a discrete space discrete time Markov Process is a general graph, that represents a distribution on sequences of nodes of the graph (given a starting node, or a starting distribution on nodes).
On the other hand, a Bayesian Network is a DAG (Directed Acyclic Graph) which represents a factorization of some joint probability distribution. Usually this representation tries to take into account conditional independence between some variables, to simplify the graph and decrease the number of parameters needed to estimate the joint probability distribution.
While I was searching for an answer to the same question I came across these answers. But none of them clarify the topic. When I found some good explanations I wanted to share with people who thought like me.
In book "Probabilistic reasoning in intelligent systems:Networks of Plausible Inference" written by Judea Pearl, chapter 3: Markov and Bayesian Networks:Two Graphical Representations of Probabilistic Knowledge, p.116:
The main weakness of Markov networks is their inability to represent induced and non-transitive dependencies; two independent variables will be directly connected by an edge, merely because some other variable depends on both. As a result, many useful independencies go unrepresented in the network. To overcome this deficiency, Bayesian networks use the richer language of directed graphs, where the directions of the arrows permit us to distinguish genuine dependencies from spurious dependencies induced by hypothetical observations.
A Markov process is a stochastic process with the Markovian property (when the index is the time, the Markovian property is a special conditional independence, which says given present, past and future are independent.)
A Bayesian network is a directed graphical model. (A Markov random field is a undirected graphical model.) A graphical model captures the conditional independence, which can be different from the Markovian property.
I am not familiar with graphical models, but I think a graphical model can be seen as a stochastic process.
-The general idea of any Markov Process is that "given the present, future is independent of the past".
-The general idea of any Bayesian method is that "given the prior, future is independent of the past", its parameters, if indexed by observations, will follow a Markov process
PLUS
"all the following will be the same in how I update my beliefs
So its parameters will really be a Markov process indexed by time, and not by observations