Brain networking analysis in migraine with and without aura

Background To apply effective connectivity by means of nonlinear Granger Causality (GC) and brain networking analysis to basal EEG and under visual stimulation by checkerboard gratings with 0.5 and 2.0 cpd as spatial frequency in migraine with aura (MA) and without aura (MO), and to compare these findings with Blood Oxygen Level Dependent (BOLD) signal changes. Methods Nineteen asymptomatic MA and MO patients and 11 age and sex matched controls (C) were recorded by 65 EEG channels. The same visual stimulation was employed to evaluate BOLD signal changes in a subgroup of MA and MO. The GC and brain networking were applied to EEG signals. Results A different pattern of reduced vs increased GC respectively in MO and MA patients, emerged in resting state. During visual stimulation, both MA and MO showed increased information transfer toward the fronto-central regions, while MA patients showed a segregated cluster of connections in the posterior regions, and an increased bold signal in the visual cortex, more evident at 2 cpd spatial frequency. Conclusions The wealth of information exchange in the parietal-occipital regions indicates a peculiar excitability of the visual cortex, a pivotal condition for the manifestation of typical aura symptoms. Electronic supplementary material The online version of this article (10.1186/s10194-017-0803-5) contains supplementary material, which is available to authorized users.


Introduction
Computational neuroscience is the study of brain function in terms of the information processing properties of the structures that make up the nervous system. It is an interdisciplinary science that links the diverse fields of neuroscience, cognitive science, and psychology with electrical engineering, computer science, mathematics and physics.
Computational neuroscience is distinct from psychological connectionism and from learning theories of disciplines such as machine learning, neural networks, and computational learning theory in that it emphasizes descriptions of functional and biological real systems and their physiology and dynamics.
These models capture the essential features of the biological system at multiple spatial-temporal scales, from membrane currents, proteins, and chemical coupling to network oscillations, columnar and topographic architecture, and learning and memory and computational models are used to frame hypotheses that can be directly tested by biological and/or psychological experiments.
The term "computational neuroscience" was first introduced in 1985 by Eric Schwartz, who organized a conference in Carmel, California, at the request of the Systems Development Foundation to provide a summary of the current status of a field which until that point was referred to by a variety of names, such as neural modeling, brain theory and neural networks. The proceedings of this definitional meeting were published in 1990 as the book "Computational Neuroscience".
The first open international meeting focused on computational neuro-science was organized by James M. Bower and John Miller in San Francisco, California in 1989 and has continued each year since as the annual CNS meeting [3]. The first graduate educational program in computational neuroscience was organized as the Computational and Neural Systems Ph.D.
program at the California Institute of Technology in 1985.
Research in computational neuroscience can be roughly categorized into several lines of inquiry, from memory and synaptic plasticity to network behavior, from learning behavior to sensory processing. But what is particularly important is the fact that most computational neuroscientists collaborate closely with experimentalists and neurologist in analyzing novel data and synthesizing new models of biological phenomena.
The present line of research is embedded in a particular branch of computational neurosciences called "neuroinformatics", a research field concerning the organization of neuroscience data by the application of computational models and analytical tools. These areas of research are important for the integration and analysis of increasingly large-volume, high-dimensional, and fine-grain experimental data. Neuroinformaticians provide computational tools, mathematical models, and create interoperable databases for clinicians and research scientists: neuroscience, infact, is a heterogeneous field, consisting of many and various sub-disciplines (eg, cognitive psychology, behavioral neuroscience, behavioral genetics and so on), and in order for our understanding of the brain to continue to deepen, it is necessary that these sub-disciplines are able to share data and findings in a meaningful way: Neuroinformaticians facilitate this, combining informatics research and brain research and providing benefits for both fields of science. On one hand, informatics facilitates brain data processing and data handling, by providing new electronic and software technologies for arranging databases, modeling and communication in brain research. On the other hand, enhanced discoveries in the field of neuroscience will invoke the development of new methods in information technologies.
In this context, the improved efficiency of processors and, by consequence, of the calculus velocity has driven computational neuroscience, and "neuroinformatics" in particular, to deal with new possibilities and new tools, such as Granger Causality, Transfer Entropy indicators that, even born in different context (econometrics, wether analysis and so on), has become important landmarks of this research branch and of fundamental utility in extracting neural dynamics from biological data, as every other kind of temporal data analysis.
Moreover, the Brain Networking has become, in the last three years, a fundamental tool to investigate dynamics between biological system, but the present work is, as a matter of fact, probably the first attempt to apply such kind of measures at human brain dynamics in such a massive way, being, so far, only applied to cortical activity of macaques or other species of primates.
If the results of the present analysis will be confirmed in the larger context of Neurology, the tools developed so far will recive a further incentive to be considered not only fundamental physical and mathematical tools, but even important investigation instruments to help medicians and neurologists to investigate and discover real features of brain dynamics.

Chapter 1
Introdution to Information Theory

Basics of Information Theory
To introduce the most fundamental concepts of information theory (IT), it is first necessary to define the concepts of simple and conditional probability.
Let, therefore, p(x, y) the probability for two events, x and y, to occur simultaneously. If the two events are independent (or "unrelated", such as the simultaneous launch of two dice), then: p(x, y) = p(x) · p(y) (1.1) or better, the probability is equal to the product of the probabilities related to individual events. This, however, is the simplest case and it is not of interest to us. In fact, if we had to calculate the probability of extracting in sequence specific cards from a deck, without the first coat has been reintroduced into it, (1.1) would not be usable. In this case we prefer to define the "conditional probability", or the probability that an event y occur if the event x already occurred: From (1.2) we can obtain: p(x, y) = p(y|x) · p(x) (1.3) which defines the "composed probability", that is the generalization of (1.1) to the case of non-independent events.
A simple property we can enunciate and that will be useful later is the following: p(x) · p(y|x) = p(y) · p(x|y) (1.4) If we introduce the inequality signs (≤ and ≥), we can also define the concept of "correlation" between two events. We will say, in fact, that p(x|y) ≥ p(x) ( or p(x|y) ≤ p(x) ) (1.5) if the two events are correlated positively (negatively), or if the occurrence of y makes it more (less) likely the occurrence of x. The converse is also true, ie if p(x|y) ≤ p(x), then the occurrence of the x event is favored by the occurrence of y.
It is important to note, for completeness, that the fact that two events are correlated or not does not imply that one of them causes the other.

Application to medicine: nomenclature
The probabilistic concepts expressed so far can be reconsidered and revised in medical field for future convenience. In this case, the composed probabilities seen previously assume different names.
If we denote by M the patient suffering from a given disease, withM the healty patient and T ± the success or failure of the test for the diagnosis of disease which M is a carrier, then • P (M ) is called incidence rate of the disease; • P (T − |M ) is the specificity of a test; • P (T + |M ) is the sensitivity of a test; • ( 1.10) that are, as it is evident from the definition, not numerically complementary.
If, for example, a certain pathology has an incidence rate on the population, P (M ), equal to 0.003 % and the carried test presents sensitivity and specificity, respectively, of 0.999 and 0.998 %, the predictive value of the test will be P (M |T + ) = 0, 6 → 60% P (M |T − ) = 0, 999 → 99, 9% Ultimately, the positive outcome of the test corresponds to a reliability of 60 %, while for the negative outcome it has a reliability of 99.9 %.
That being so, we can move on to define the bases of the information physical concept.

Information and Entropy
Let a generic event x, which the probability p(x) that may occur is associated with. We define as information, I(x), corresponding to x, the amount The reasons why we have chosen this particular structure for (1.11) are obvious: the combination of the sign minus and logarithm is to indicate that the less likely is the occurrence of the event x, the more the information it will bring with; the contrary happens for very probable events, which bring with them little information. Furthermore, the fact that a probability can not exceed, numerically, the unit value translates into the fact that the information cannot be negative in any case, in agreement with the second law of thermodynamics; in addition, it is defined in such a way that a sure event, p(x) = 1, is correlated to any information. I(x), finally, is a monotonic function strictly decreasing and continuous (no "probability jumps").
Closely related to the concept of information, there is a particular form of entropy, called information entropy, which measures the average information content of a given source. Its definition is not unique and depends on the scope and usage of the particular conceptual area. Some definitions, in fact, concieve the information entropy as a measure of the amount of uncertainty present in a random signal, other as a measure of the information contained therein, or again as the minimum descriptive complexity of a random variable, ie the lower limit of data compression without any loss of information: Kullback entropy and conditional mutual entropy (of which we will not deal with in this work) are an obvious example.
Even for what concerns its operational definition, there is a wide choice of possibilities. However in this line of research we will refer to the concept of information entropy according to Shannon [1].
Be given, therefore, an ordered set X = {x 1 , x 2 , ..., x n } of events. According to Shannon definition, the information entropy for this set will be: where p(x) is the usual probability contained in x and the sum is extended to all elements of the X sequence.
The fact that the proability related to a particular event may also be zero (which would diverge the logarithm) is not a critical point of theory, as the divergence can be easily avoided recalling the basic limit lim x→0 x · log(x) = 0 (1.13) Furthermore, the entropy function shows a maximum in correspondence the probability value p = 1/2 (Figure 1.1).
As evident, we have not indicated which basis the logarithm uses, which is also important for the numerical entity of entropy. The different choices, If we want to generalize and, instead of a single set of data, two set of events are available, X = {x 1 , x 2 , ..., x n } and Y = {y 1 , y 2 , ..., y n }, the Shannon information entropy will be differently calculated depending on whether the two sets X and Y are "related" (or, more precisely, not independent) or not. In the first case we have p(x, y) · log[p(x, y)] (1.14) in the second, however, we will have that (1.14) turns as follows: x,y p(x, y) · log[p(x) · p(y)] (1.15) According to (1.14) and (1.15), we can define an certain number of combinations that define, in different ways and case by case, the concept of "mutual information" m i , a measure of the amount of information shared by two systems.
A working definition, useful for assessing m i , is the one that identifies it as the difference between the entropies H I and H D of the two systems, or as the difference between the case in which the two systems are independent, and that of the case in which they are not: x,y p(x, y)log p(x, y) p(x)p(y) (1. 16) The mutual information, however, is not entirely suitable for our purposes, because it does not provide anything of predictive: it does not anticipate the future dynamics of the systems X and Y , nor provides a probability that the (x n+1 )-th event can occur on the basis of the sequences X and Y .
The reason for this lies in the very structure of (1. 16), that is invariant for the exchange X ↔ Y . In addition, the (1. 16) gives no information about the directionality of the information flow, if it moves from X to Y or vice versa.
Moreover, the MI is only a "static" measure of the amount of information shared by two time series, completely ignoring the direction of the information the same.
This complications can be prevented or by using suitable time shifts of the series X with respect to Y , (the so-called M shif t ) which allows us to study the variation of MI (and, therefore, the flow of information) in function of the displacement of a time series with respect to the other or, vice versa, according to the Schreiber idea [2], by defining a new type of entropy that somehow "captures" the dynamics of the system based on the changing rate of (1.15).

Transfer Entropy
Let supposed the existence of two events series in strict chronological order: ., x n } and Y = {y 1 , ..., y n }, and suppose we want to find any correlation between the two sequences and the "future" element x n+1 of the first series. The amount of additional information that would serve to represent (or, equivalently, to predict) such a value, is p(x n+1 , x n , y n ) · log[p(x n+1 |x n , y n )] (1.17) If the observation value is independent from y n , then: p(x n+1 , x n , y n ) · log[p(x n+1 |x n )] (1.18) While the (1.17) is the entropy of the two systems, the (1.18) represents the rate of additional entropy assuming that x n+1 is independent of y n .
We define the transfer entropy (hereafter simply TE) as the difference between the two previous amount: H 2 − H 1 . In other words, the TE is configured as the difference between two different entropy rates: the first related to the amount of information necessary to deduce the element x n+1 on the basis of all its known preceeding elements, the second related to the information necessary to deduce x n+1 according to both the sequences X and Y preceeding it.
Equivalently, the TE can also be understood as a measure of how the uncertainty in the prediction of x n+1 on the basis of the only elements of X is reduced by the introduction of Y in the calculation.
Based on the above definition, the TE has a functional form of the type : According to (1.19), the TE has the minimum value (zero) when the final state is independent of the second time series: p(x n+1 |x n ) = p(x n+1 |x n , y n ) (1.20) and may assume at most the unit value. It can not, under any circumstances, be negative, as the relation p(x n+1 |x n ) ≥ p(x n+1 |x n , y n ) is always true.
The logarithm used in (1.19) is natural or decimal based. If, however, one may wish to take the base 2, the TE would assume a particular value: its units would become the bit and the TE the same could be interpreted as the amount of bits that must be extracted from the set X and Y to encode the information carried from x n+1 .
The previous formulation, however, is not complete because, as it is obvious, there is no symmetry for the exchange x ↔ y (or, as we will say from henceforth, for exchange of the two channels, X and Y ). We will have, therefore, to reconsider the TE so as to include both of the following cases: we get the equivalent forms: In practice, the value of T is a measure of how many digits of the series X can be calculated back to from elements of Y and vice versa. In this way, the TE is better suited than m i and M shif t to determine quantity and direction of the information flow, as it is, always according Shreiber, "a real directional and dynamical measurement of information transfer" [3].
The relations from (1.21) to (1.24) can be greatly refined by entering informations about the "temporal window" within which the TE is calculated (ie the time interval within which it is assumed that the series of elements x i are affecting the elements of the Y channel and vice versa, the so-called lag, δ), and the model order, or the actual number of elements of the set X and Y which are supposed to "influence" the element x n+1 .
This informations can be inserted in the previous relations in the following way: in which x (k) n and y (l) n represent, respectively, the k values preceeding, at regular intervals, x n and the l value preceeding the y n elements. The lag enters in previous relations when we choose the time distance between (x n , y n ) and x n+1 , different from unit.
If we consider the fact that, generally, k and l are set equal to each other: and that the m values of X and Y are equidistant from each other by the same amount τ (the "delay"), the total width of the time window on which each element of the sum is evaluated is ∆ + mτ . In this context, m is called the model order.
If in (1.25) and (1.26) we do not take the sum on time windows, the TE series of values is, in any case, the time behavior of TE between channels X and Y .
As it can be seen from the simple observation of (1.25) and (1.26), the focus of the entire calculus is an estimate of the probability functions (pf ) p(x n+1 |x n , y n ) and p(y n+1 |x n , y n ).

Strategies for pf estimation
Several methods can be used for the estimation of the pf, each with its limitations and its complexity.
Here we will present a few, in order of increasing theoretical and computational complexity.

Method of variable-width binning
Let a fixed number m of events for each channel X and Y (m is as always the order of the model) preceeding, equidistant among them of a quantity τ , the x t element. The two series of events (simultaneous and temporally ordered) ., x t−τ } and Y p = {y tmτ , ..., y t−τ } constitute the "past" of the x t event. Let δ (lag) the temporal distance between x t and the most recent series elements representing its past, x t−τ and y t−τ .
The first step for the estimation of the pf plans to allocate the series of events X p ⊗ Y p ⊗ x t , (in this case, ⊗ represents the vectorial concatenation, or the cartesian product, of the three time series X, Y and x t ) in a (2m + 1)dimensional space (called "events space"), whose dimensions are precisely For each combination of 2m+1 vector elements there corresponds a point in the events space, which starting value is zero and is incremented by 1 everytime the given combination is repeated. As we move forward with the time window (δ + mτ samples large) in which it is believed that the channel Y will affect the channel X, the events space starts populating more or less uniformly, depending on whether the channel Y affects or not X (figure 1.2 -B).
In case m equals 1, the events space is three-dimensional, otherwise, in case m ≥ 2, the events space is an hypervolume of 2m + 1 dimensions. Moreover, in case we want to consider only two channels, X and Y , the analysis is called bivariate, and if the "past" of x t is extended to more than two channels (say X, Y and Z ) the analysis (the one we will really apply to our electroencefalographic data) is called multivariate, and will be discussed more in detail at the end of this section.
At this point, the conditional probabilities appearing in (1.25) and (1.26) can be estimated in two different ways: by directly summing the probabilities found (whose module is equal to the value the given point assumes with respect to the total number of non-null points of the whole space), or by passing from this to the probability space, which is constructed basing on the first and sharing the same number of dimensions, but with a finite exten- c o n d it io n a l d is t r ib u t io n s Δ→0, normalization  sion and unit volume, the probability being limited to values between 0 and 1 (figure 1.2 -C). Each point of this space represents the fraction of events presenting the probability p( X p , Y p , x t ) to occur. It is preferred, in addition, to reorder the events in this space so that the probability is increasing as one moves from left to right (method of the ordinal sampling). The reason for this choice lies the fact that TE is well defined only if the pf presents no singular points, or better if there are not points which distributionis is Dirac Delta type, which, in fact, is highly likely to happen in the volume occupied by the points of the sample space.
An equivalent way to do this [4,5] consists in normailzing the initial data to fit the interval [0, 1], so that the space of events is automatically finished and unit volume. In this case the composed probabilities (1.25) and (1.26) can be derived by setting an interval (bin) common to all dimensions, ∆, and considering in turn the basic volumes (X i p ± ∆ · Y k p ± ∆) and unit height (i and k are indices that run along the space dimensions).
In this case, the composed probablities p(x t | X p , Y p ) can be obtained by adding together, at different heights, nonzero elements on the cutting plane identified by the pair ( X p , Y p ), normalizing and reducing to zero the interval ∆. As well, different pfs are obtained as a function of X p and Y p (figure 1.2 -E). Similarly, the probability p(x t , X p ) is obtained marginalizing the dimension spanned by Y p (ie, summing over all the Y p values), no longer working on volumes but on surfaces centered around X p and large 2∆ (figure 1.2 -D).

Kernel Density Estimation
The Kernel Density Estimation method (KDE ) is used to estimate the pf by adding together individual distributions centered on each element of the series.
Considering, in fact, the events space of the previous case, we can imagine that in each volume ∆ wide, centered on the generic point {x t ,x t−τ ,ỹ t−τ } of the (three-dimensional) sample space, there are P points, whose distribution is described by generic function K(x t , x t−τ , y t−τ ) (called kernel of the model). K depends not only on space coordinates, but also by a number of parameters, some of which have well-defined values, while others must be chosen ad hoc, in such a way as to make the shape of the distribution as close as possible to the real data. In addition, due to issues related to probabilities normalization, it has to show some peculiar features, such as the rapid decrease estranging from the distribution center 1 and must meet the following three conditions: 1 with relation to this peculiarity, the concept of distance is crucial to the estimation of the pf, so that the previously seen method of sorting points according to their increasing probability can not be adopted.
The probability density is estimated [4] by normalizing the following expression: where j is the index identifying all points within the elementary volume and h (.) is the characteristic width of the distribution along the axis indicated by the subscript. This width has an universally accepted value [6] equal to whereσ is the standard deviation of the sample along the direction defined by the subscript of h. This expression for h can be effectively replaced by another [6] which considers, rather than the standard deviation, the interquartile range (IQR), or the width of the range containing the middle half of the observed values, assuming that the sample has normal distribution: The choice of the h value is of primary importance, because too small values can lead to pfs characterized by peaks or singular points, while too high values can lead to poorly differentiated distributions (figure 1.4).
Concerning K(· · · ), indeed, there are several functional forms that can be used [7]; the most common are shown in the following table: The most widely used distribution for the kernel is the Gaussian distribution with all its variants (unimodal, bimodal, etc..), which for h (.) tending to zero reduces to a Dirac delta distribution. It is important to stress one aspect of (1.28): as the pf was built as a product of different kernels, one for each axis of the events space independent among them, this does not mean that the variables that are referred to are equivalently independent each other; in the latter case the general form of (1.28) would be of the type: where D is the number of dimensions of the events space and N is the number of points in the elementary volume.
The remaining probability featuring in (1.28) can be calculated starting from p(x t ,x t−τ ,ỹ t−τ ) marginalizing, each time, the distribution.  three-dimensional space identified by these points and the actual values of the future of one of them can be divided into a certain number of different sizes cubic shape sub-spaces (generally we start with 8 cubes, whose vertices are identified by the points on the axes having probability equal to 0, 0.5 and 1).

Darbellay-Vajda partitive algorithm
Each of these cubes will contain a certain number of points, on which the χ 2 statistic can be calculated to verify the points being or not uniformly distributed in space [4]: in which {M 1 , ..., M 8 } represents the series' points numbers in each cube and µ M is the total number of points in the probability space divided by 8 (or the number of cubes). If the s χ 2 test value is numerically greater than the χ 2 95% (7) value (5% significance level with 7 degrees of freedom), then the null hypothesis can be rejected, and for each of the eight cubes we proceed to a further division into eight subcubes in the same way (figure 1.6).
In a recursive way we will reach a point where the null hypothesis can no longer be rejected: the last eight cubes will be considered as a single entity (ie, the last partition has no validity) and we end up with a certain number L of partitions of the probability space, number in which have been deleted partitions not containing points within them.
The (1.25) can, therefore, be rewritten as follows: with P representing the total number of points in space, n k the number of points in the k-th partition and n represents the number of points throughout the space whose values, independently on the other dimensions, are in the range [1,5] along the dimension of v i−1 .

Comparison of the algorithms
It is possible to compare the results from different algorithms outlined above.
Using a series of simulated data [4], it has been proved that all three methods effectively identify the appropriate time interval elapsing between an event in the channel X and the one it "caused" in channel Y (the lag, in this case it is equal to 2, as one can see in figure 1.7); also, the calculated TE level is comparable in the first and in the third case, while with the KDE algorithm there was a slight decrease (figure 1.7). This leads us, in the choice of the algorithm to use, to lean towards the former model, both for issues related to the computation time and because it provides both higher mean values of TE than in the two other cases in the time intervals in which there is do an actual causal connection between the two channels.

Multivariate TE
There are cases in which the space-time series constituting a complex system are more than two, such as the inforation flow between various areas of the cerebral cortex. More importantly, these series can interact between them, so that the behavior of each of them is not determined only on the one of another, but on the net and contemporary behavior of all the others.
In this case the formulation of the TE outlined here is insufficient, since it does not take into account this effect, but only the relationship between two of them, as if the surrounding universe did not exist (the bivariate case).
To overcome this deficiency, the theory must be reformulated in the multivariate way, trying to introduce corrections that provide the net amount of information exchanged uner the action that other occurring time series can have on that under investigation [8].
Fortunately, the TE formalism outlined here lends well itself to multivaria-te extension (MVAR), and requires only a redefinition of relation (1.19).
Let therefore X, Y and Z three time series (three channels) that can mutually interact. If we denote, for shorteness of notation, with X − , Y − and Z − their past (vectors whose size depends, as we have already seen, on the model order), with x the current value of the series X and with ⊕ the concatenation operator, then it can be shown that, with a procedure identical to that already seen, one get [9]: with H(·|·) representnig the conditional entropy. All other possible combinations of the three (or more) channels can be obtained and calculated with the same modalities seen previously, being the partitive algorithms become richer just of a few more dimensions.

Cross-Correlation
As we have seen so far, the calculation of TE is intimately linked to the choice of the lag parameter, ie, the elapsed time between the current value of the time series under examination and the past of time series that are supposed to influence it.
Obviously, this value may make more or less valid the study, therefore its choice must be made using appropriate tools.
The best way to do so is to perform the calculation of TE varying time to time the lag value, so as to obtain the temporal variation of transferred information and, once identified the maximum, to obtain the moments in time where the exchanged information is maximum. Moreover, this is the best way to study the temporal dynamics of information as it makes obvious the temporal changes in interaction between the various time series under investigation. However, this requires a long computation time, especially because it is not known a priori the temporal extension of the lag, and this inevitably leads to an unnecessary waste of calculation time.
One way to impose constraints on the variability of the lag is to calculate the Cross-Correlation (CC) between the two time series, defined as the covariance of the two temporally ordered and normalized series X and Y . Formally, it can be considered as the normalized convolution of the two series, which in the case of discretized data can be expressed as: where k is the number of bins (common to X and Y ) in which the two time series were decomposed. In the case of biological series, k can be the

Synchronization Entropy
A second type of entropy that can be evaluated between two channels X and Y is the Synchronization Entropy (SE), defined in a similar way to that of Shannon's information (1.12), but referring to the probability that the two time series are synchronized.
It is important to note immediately that the synchronization between two time series is not synonymous of their correlation (or coherence). However it can be shown [10] that if two systems are synchronized, then certainly they are also correlated (this condition is necessary, but not sufficient).
With this regard, the most appropriate way [11,12] to define the phase of a signal (or a time series) s(t), variable in time, is to introduce a generalization of the signal the same in a complex space, imagining that s(t) is only the real part of a complex signal, ie composed by a real part and a purely imaginary one. We denote by ζ(t) this complex function of a real variable (the time, t), which can be rewritten as [13,14] with S H (t) representing the Hilbert transform of the original signal s(t): and P V indicating the principal value integral. Once the calculation of (1. 36) has been done, the signal amplitude A(t) and its phase φ(t) are fixed instant by instant. We note explicitly as the Hilbert transform does not contain additional parameters to be calculated. can be implemented as an ideal filter whose amplitude response is 1 and whose frequency response is equal to π/2 throughout the entire frequencies band.
It can be demonstrated [15] that (1.36) has physical meaning only in the moment in which the initial signal s(t) is constituted by a narrow band of frequencies: in this case the amplitude A(t) represents the envelope of the signal s(t), and the instantaneous frequency is the frequency of the signal presenting a maximum in the spectrum of s(t).
Considering two different time-varying signals, s 1 (t) and s 2 (t), we can define the generalized phase difference among these in the following way: where m and n are two appropriate weights; as it is evident, φ m,n is defined up to an additive factor 2π. If the two signals, within a certain time interval, are synchronized, the distribution of the phase difference will be peacked on a certain value, otherwise it will present itself as uniform. Similarly, if we report the values of the two phases in a two-dimensional diagram, the  it is possible to consider both of them as equal to 1.
As already seen for the CC, the SE is invariant too under the inversion of the two sets X and Y , for which the corresponding functional map, obtained by associating to each pair of channels the SE value, will be symmetrical (figure 1.11).
The SE can be calculated, as previously mentioned, similarly to (1.12), namely: (1. 39) in which N represents the number of intervals within which the probability is subdivided, and p(i) is the probability of occurrence for the i-th event (the phase difference, in the specific case). However, differently to what happened for the calculation of the TE, in the present case it is possible to make a correction to the previous relations that cancels the error due to the quantization interval in which the probability is subdivided [16]: wherep(i) represents the frequency with which the i-th combination is present and m is the number of intervals in N having at least one point.

Chapter 2
Time-Frequency analysis of signals

Introduction
For a time series, the Fourier transform (or series) is a useful tool for very different purposes. One of these, and probably the main one, lies in the fact that its modulus describes the contribution given to the time series from the various frequencies which can be decomposed, while its square module describes the contribution of the various frequencies to the total energy. Very often, in fact, it is of primary importance to know whether certain frequencies in a signal are more active than others, as their presence, absence or different relative amplitude can be related, for example, in pathological changes of the electroencephalographic rhythms of patients suffering from a given disease.

Fourier Analysis
Given a signal f (t) ∈ L 2 (R) in the time domain, the Fourier Transform (FT) gives a new representation of the signal in the new variable ω domain. The mathematical expression identifying the FT is the following: If the variable ω is considered as a pulse and we require that ω = 2πν, then we get:f The function generated by (2.2) depends only on the variable ν, which can be identified as a frequency. The usefulness of such transform resides in the fact that its squared modulus returns, for each frequency, the amplitude of each sinusoidal component constituting the signal, if this is imagined, of course, as the sum of infinite periodic components.
In turn, the individual components are involved with their contribution to the definition of the signal total energy. If we remember, in fact, that for a signal limited in time the total carried energy is the Parseval relation ensure that: S f (ν) = |f (ν)| 2 is called energy spectrum of the signal f (t). Similarly, we obtain for the signal power: Let's consider, for example, the two signals shown below: On the basis of what has been seen, the Fourier analysis would return, for both signals, the same spectrum: a series of three integer multiples frequencies of ω, the amplitude of each is A i , and this even though the two signals are deeply different. And this applies to any series of frequencies (see figure   2.1).
This phenomenon must not be surprising: in accordance with the Heisen- it is important to study the dynamics of the time series, highlighting, for example, the occurrence of oscillations or transients following a certain event such as, in case of an electroencephalographic recording, a stimulation.

Wavelet Analysis
The SWFT, however, has poor inflexiblity because the signal portions are extracted all of the same length, regardless of the frequency content of the signal. Furthermore, the ability of this method to resolve the signal components in the right moment (or at least in compatibility with the uncertainty principle) depends on the windows width which, with the approximating zero, can generate undesirable "edge effects"; it can even require the introduction of "artificial" signals on the external right and left of the window, and this would fit in the window non-real frequencies (in truth, as we shall see, the wavelet transform will suffer from the same problem, but in that case there exist a quantitative parameter taking this phenomenon into account).
A more complex version of SWFT, but that best suits the problem, is to reconsider the form of the window function w(t), its magnitude and its temporal extension.
The basic idea of the wavelet functions analysis (meaning "small waves") is to use rectangles of different amplitudes to localize components in the time-frequency plane: more precisely, the localization in frequency decreases logarithmically with the increasing frequency, while the temmporal localization gradually becomes higher (figure 2.2). Differently form Fourier analysis, which passes from a purely temporal representation of the signal to a purely The time-frequency localization is obtained by replacing, in (2.7), the W (t) function with the so-called mother wavelet function, ψ a,b (t), where t is as always the time and a and b are two characteristic free parameters, the first called scaling parameter and the second translation factor. These two parameters are responsible, respectively, to stretch or shrink the wavelet (and, consequently, to vary the frequency of the wave packet into it) and slide the wave along the time series, in agreement with the definition of convolution product.
From a geometrical point of view, the mother wavelet is a wave packet localized in time and modulated by a Gaussian, whose width is related to the frequency of the wave packet in such a way that, beneath the envelope, a well-defined number of oscillations can take place. In this context, the parameters a and b are particularly important: • large values of the scale parameter a is equivalent to lengthen the wavelet and its support, decreasing consequently the frequency of the wavelet the same, the reverse being true for low scaling parameters; • small values of the scale parameter a must match equivalently small Based on these reasons, the Wavelet Transform (WT) turns out to be, in analogy with (2.7): As a result of the convolution of the two time depending functions, the matrixũ a,b does not depend on time, and each element is a coefficient quantifying the similarity between the original signal and the wavelet function at specific scale a (equivalent to a certain frequency) and at a specific time shift b.
In the passage from the continuous to the discrete case, typical in computer procedures, one can choose to tie together the two parameters using the integer j in the following way: a = 2 −j and b = ka = k2 −j , with j, k ∈ Z.
On this basis, the ψ(2 j t − k) components correspond, in the time-frequency plane, to rectangles of variable size 2 −j × 2 j (figure 2.2).
The main advantage of the wavelet transform usage is the fact that provides a multiscale decomposition of the signal: for each scale j, the signal is decomposed into elementary components whose frequency content increases with the scale. In general, the multiscale decomposition of a signal consists in a poorly accurate mean value of the signal at a given scale (content at low frequencies) with more details calculated at more accurate scales (high frequency content). If the signal varies slowly, the details are not important to reconstruct the signal, so the multiscale representation provides a natural tool for the compression: it is sufficient to overlook details below a certain threshold and transmit only the most significant ones. The same technique can be used to reduce noise. On the other hand, the most significant details provided by the wavelet analysis correspond to the areas where the signal has large variations, for example in correspondence of a peak: this causes the wavelet analysis to be successfully applied in the analysis of signals where you need to extract information about the " geometrical" stuctures of data.
Finally, the mothers wavelet (or "bases") can be used for the representation and compression of integro-differential operators: this allows to build efficient numerical methods for the solution of integral equations or of partial derivatives. Several of these applications are discussed in [17], where a detailed bibliography is also given.

Mother Wavelet
The mother wavelet function must have specific characteristics defining it uniquely. As we said, it must be a function with a compact support to the very principle of localization of component frequencies.
Second, it must meet a number of specific mathematical requirements, such as presenting a Fourier transformψ(ω) leading to a limited c ψ parameter: For this condition it is sufficient that the wavelet is such that: This consideration arises from the fact that, as in the case of Fourier transform, for the WT too it must be possible to trace back the energy spectrum of the signal, but in a localized way compared to the FT, which represents, in some way, the time integral.
Considering, in fact, both the transform (2.8) and its inverse we can firstly find that there is no information loss in the passage from u(t) toũ a,b and vice versa, and that we have, for the Parseval relationship: Another feature that the mother wavelet can own (although it is not strictly necessary) is that the higher moments (at least the second, as we are going to see in the next section) are all zero: R t n ψ(t)dt = 0 (2.13) or that, at least, the first two moments are zero: this aspect is conencted to the best resolution that a wavelet presents at high frequencies with the increasing number of vanishing moments (see below). Furthermore, it must be fullfilled the condition with C > 0 and p ∈ N. Moreover, a feature that the mother wavelet should possibly have is to be "localized" both in the frequency domain and in time. Before we can browse the most commonly used mothers wavelet, it is imperative to introduce the concept of cone of influence of a mother wavelet.
As we have seen, the convolution between two signals requires the wavetet mother to be translated, with all its support (say, [−S, S]), along the signal to be analyzed. However, at the initial and final eges of the signal, the wavelet support can easily (and, as a matter of fact, it does) exceed that of the signal, thus causing unwanted edge effects. These effects can be analytically treated extending the signal out of its cradle, for example by assuming the periodicity of the signal outside of its support, or performing a mirroring of the same. These corrections, however, do not eliminate artifacts that can occur, but their weight can be evaluated considering the cone of influence of the transform.
Let's imagine, therefore, to consider the generic point τ of the time series.
The cone of influence of this point is defined in the time-frequency plane as the set of all points in the plane such that the point τ is content in the mother wavelet ψ a,b (t) = a −1/2 ψ((tb)/a) support. Being the latter equal to Consequently, if b is located in the cone of influence of τ , then the wavelet Let us now consider a series of wavelet which, generally, the analysis is performed.

Haar Mother Wavelet
The Haar wavelet is the simplest among the mother wavelet that is possible to use. It's defined as follows: Moreover, it is easy to verify that only the first moment of (2.13) is zero. it is evident the greater detail at high frequencies for the second wavelet.

Mexican Hat
It is possible to demonstrate that any p-order derivative of a Gaussian function has the characteristics previously seen to be considered a mother wavelet.
In addition, it can be shown that in this case all the moments up to p-th vanish.
Starting from this principle, we can construct an entire family of mother wavelet whose elements are characterized by the growing derivative order, each having higher resolution at higher frequencies. If we observe, in fact, figure 2.5, we can notice how, with the passage of the derivative order from the first to the second, the resolution at high frequencies for the time series shown in the top -common to both analyses-increases (area circled in red).
More in detail, it is evident the way in which the wavelet of the first and second order are odd and even respectively: this is reflected in the way in which the first order is more sensitive to only increment or decrement of the signal, while the second, more accurate, better distinguishes the speed with which the signal varies over time. For this reasons, the second order derivative is preferred to the first (and to any succeeding, both even and odd).
So, starting from the Gaussian function, with σ = a 2 /2 the corresponding mother wavelet is Because of its shape, such a wavelet is called Mexican Hat (or "sombrero", see figure 2.5, bottom). As it is evident since the previous relation, it can be considered as the difference between two Gaussian filters of different scale, divided by the scale difference itself.
In consequence of the above mentioned Parseval relation, the localized spectral energy density for a Mexican Hat is given by the relation A further step forward for this class of wavelet consists in making the functions ψ 2 complex, for example using the Hilbert transform. However, when it becomes necessary to switch to a complex formulation, we prefer to rely on other mother wavelets families.

Morlet Wavelet
The Morlet wavelet, complex, is one of the most used for the time-frequency analysis of signals. It consists of a wave train of central frequency z 0 modulated by a Gaussian whose width is z 0 /π: The factor e −z 2 0 /2 is called "correction factor", as it is used to correct the non-zero mean of the complex sine wave. c ψ is the normalization coefficient that is not uniquely determined, but related to the value of z 0 that, in addition to defining the central frequency, also controls the number of oscillations within the package. The choice usually accepted for z 0 is 5 or 7, as for these values (larger than 5) the correction factor is very small and can be approximated to zero. In the following table, we show some pairs (z 0 , c ψ ) used in literature: Similarly to the previous case, the power spectral density of the signal can be expressed by means of the Parseval relation: with ω = 1/a andũ M representing the wavelet transform of the signal. Since this is a complex wavelet, the transform will be composed by a real part and an imaginary part: the first provides information on amplitude variation of frequencies that make up the signal, the second will describe the phase variation of the components themselves.

Comparison of the mother wavelet
The choice of the mother wavelet to be used to perform the transform is subject to the specific utility of the result and the characteristics the single wavelet.
A first comparison between them can be carried out considering the Fourier spectrum of the wave packet. The Haar mother wavelet, for example, presents a frequency spectrum that is not limited as a consequence of  A final comparison can take place according to the shape of the wavelet.
The Mexican hat has little periodicity (a central maximum and two minima at the two sides of the support) and consequently a more marked tendency to highlight local maxima and minima when these are not rapidly succeeding: as already said, in fact, the transform coefficients represent the level of similarity between the mother wavelet and the fragment of analyzed signal. The Morlet wavelet, at the contrary, presents a strong periodicity and appears to be more suited to highlight fast sequences of maxima and minima, at the cost of lower temporal resolution.

Introduction
As we saw earlier, it is possible to assess whether and how two or more time series share information or exhibit similarity of different nature. In the section dedicated to the TE, for example, we understood how to study, starting from the statistical data, the information flowed from a series X to a series Y that have been recorded simultaneously. We are now going to understand if we can not only study what happened between two time series, but even if we can go further and try to predict the trend of a time series on the basis of its own previous behavior and on the basis of what happens in one or more contemporary time series.
The first attempt of this kind is due to Clive Granger [24], who in a pioneering article of 1967, awarded two years later with the Nobel Prize for Economics, proposed a vector autoregressive model (VAR) to study the mutual influences between the financial markets. Although it was born in the economic sphere, the model of Granger Causality (or G-causality, GC) can also be extended to other areas including the dynamics of nonlinear systems [25,26] and neuroscience [27], helping us to understand what are, for example, the causal relationships between heart rate and blood arterial pressure or between heart rate and respiratory rate [28] and so on.
In recent times, the development of computer technology and the increase in computational power of computers let it possible to extend the scope of the GC until the nonlinear study of complex systems consisting in a large number of time series, which are, for example, recordings of electrical potentials on the scalp, thus opening the door to the numerical study of information dynamic in the cerebral cortex [29].

Granger Causality: the linear model
To fully understand the vector autoregressive model proposed by Granger [24], it is necessary to introduce, since the beginning, the formalism that we will use from now on.
.N , two time series of data simultaneously measured, with the same number N of samples. From here on we will assume that, for our model to be valid, the two time series are stationary: this means that we have to deal with constant mean and variance signals, and the covariance of any pair os segments belonging to the signal depends only on their relative distance in the signal itself.
Let us consider now the integer m, that will be called, as for the TE, order of the model, and the integer k, which can take values from 1 to the integer M = N−m. We will denote by x k the (k + m)-th element of the initial series,x k+m . Since m is related to the "width" of the considered window to calculate the causality, x k is the k-th element after the considered time window. The same holds for the second series. Finally we define the series X k = (x k+m−1 , ...,x k ) and Y k = (ȳ k+m−1 , ...,ȳ k ) as the "past" of the elements x k and y k respectively. Since k ranges from 1 to M , depending on the model order we can consider M realizations of the stochastic variables (x, y, X, Y). The VAR model provides, at this point, the introduction of the following linear system where {W} is a set of four real m-dimensional vector whose values will be derived from the actual time series data.
The system can be solved with respect to the set {W} by means of least squares techniques, giving [30]: and in which the operatorÃ identifies the block matrix The elements of the Σ matrix, and that of the vectors T, are obtainable from the actual elements of the series in the following way: while for the vectors T: where ( , ) represents an inner product.
As one can see, the dimensionality of matrices and vectors in question increases with the order of the model, and with this latter also increases the accuracy in the estimation of the error (this is an autoregressive model, so error must exist and must be different from zero) and its absolute value. If we denote by xy and yx such errors, their form appears to be [30]: in which V ar() represents the variance operator. This model hypotizes, of course, that both sets X k and Y k contribute to "cause", within a given range, the trend of the variables x and y k . If we forget for a moment this cross-dependence, then the model, simply called autoregressive (AR), would provide that In this case, the least square method would return the results As a result, the estimated variance of x − V 1 · X and y − V 2 · Y, that we denote with x and y (representing the prediction error of x k and y k on the basis of the knowledge only of their past) generates the following results: If the prediction error xy is found to be smaller than x , then we could say that considering both the series helps to predict the future of x k better than just considering only its past. The same is true for the pair yx and y .
It is said that the signal {ȳ i } has a causal influence or Granger-causes the set {x i } at the order m.
One way to quantify this causality is to compare among them the prediction errors: we can introduce two parameters, c 1 = x − xy and c 2 = y − yx , and define a directionality index D in such a way that: This imprecision is exceeded, as we shall see in the following paragraphs, reconsidering functional form of D (l) , cue even for important considerations.
For sufficiently long time series (ie, for N sufficiently large, as for example the EEGs we are going to deal with, spanning from a few thousand to about 10 5 samples), and according to the definition of causality, the following two properties hold: • if Y is not correlated with X and x, then x = xy ; • if X is not correlated with Y and y, then y = yx .
Considering only the first of the two properties (for the second the symmetrical speech holds), the non-correlation results mathematically in the fact that the operatorÃ −1 is diagonal (Σ xy = Σ yx = 0) and at the same time the vector T 12 , that somehow blend causal dependencies, is identically zero. As a result, VAR and AR models coincide.
According to the definition, the possibility that the indicator D (l) proves himself suitable to describe a linear type causality lies in the verifying of above mentioned properties. However, it may be necessary to overcome the assumptions of linearity and consider possible nonlinear causal links between two temporally ordered series, as in these conditions the higher order corrections may become more important than in the linear case, in which are confined in bidding for small corrections.
In this case the two previously considered properties must be restated [31] in the following way: • if Y is statistically independent of X and x, then x = xy ; • if X is statistically independent of Y and y, then y = yx .
As recent studies [32] have tried to bring the non-linear causality in the filed of the linear one (as occurs with the local approximation of a curve with a straight line segment in Euclidean spaces), the best practicable choice is to consider sets of functions that, characterizing the non-linearity of causality, can globally fullfill the property just enunciated.

Granger Causality: nonlinear model
To characterize the non-linearity in the Granger causality we must reconsider the initial system (3.1) introducing two generic non-linear vector functions, each of n components and m variables, Ψ = (ψ 1 , ..., ψ n ) and Φ = (φ 1 , ..., φ n ) [30]: where {Ω} is a set of 4 real n-dimensional vectors. The choice of the integer n is related, as we shall see, to the choice of the basic nonlinear functions.
Once the two functions Ψ and Φ are fixed, the system (3.21) represents a linear space generated by the components of the two functions in 4n variables {Ω}, whose elements have to be established to minimize the prediction errors.
Using, as before, the least squares method to solve the system (3.21) we obtain: and where † refers to the pseudo-inverse matrix [33] and elements of S and t are computed as follows: Based on the knowledge of these elements, we obtain Such non-linear VAR model is to be compared, as previously, with the non-linear AR model, which is based on system and consequently Even in this non-linear model, the D (nl) parameter has the same functional structure as the linear case, with the only difference that the determining elements has a nonlinear behavior.
It is possible to prove [30] that this choice satisfies the conditions seen previously: if we impose the conditions are true (eg, the first), then for every µ = 1, ..., n and for every ν = 1, ..., n, the function φ µ (Y ) results as not correlated with x and Ψ ν (X). It follows that (3.31) and for very large N , xy tends to zero in accordance with its definition, Ω 12 will be identically null and the non-linear AR and VAR coincide. The same procedure is true swapping x and y .
What still remains unsettled is the choice of non-linear functions to be used. The indication is to use classical p-degree polynomial functions (returning, for p = 0, the linear case) or Gaussian functions. From the computational speed point of view of there is not much difference between the two; however, the Gaussian function is preferable, both because best approximates the probability distribution of the experimental points (as has already been seen for the TE), and because is well suited for use in clustering algorithms as kernel [34,35,36].
In this case, setting n << M in the space spanned by the vector X, n different centers of coordinate {X ρ } M ρ = 1 are primarily identified (with any clustering method applied to the data {X k } M k=1 ). The same occuring in the space of the vectors Y, finding the n centers {Ỹ ρ } M ρ=1 . Such centers represent the prototypes of the variables X, for which the functions ψ represent a measure of the similarity between the paths from these to every other point in the space. The same applies to {Ỹ ρ } and φ .
From a mathematical point of view, for the p-degree polynomials, the choice of the functions to be used is while for the Gaussian functions we have: in which σ is a parameter that must be fixed in each case in such a way as to avoid the data overfitting, but whose order of magnitude, in any case, is that of the average distance between the points of the spaces generated by

Complete Multivariate Model (MVAR)
As we have seen in the last paragraph, the GC has a vagueness in the definition of the indicator D, not defining precisely how much a time series is G-causing a second if not for the net influence that the second can have on the first. To overcome this ambiguity, we can choose [37] not to use the given definition, but to directly compare the variances of their accuracy errors in the VAR and AR cases: in this way the systems (3.1) and (3.15) can be corrected by introducing the precision errors (.) and generalizing to the case of all the future of every single series X and Y in the following way [38]: and where m is as always the model order (ie, number of elements of the past that may affect the future of destination series) and the vectors x(t) and y(t) represent all elements of the future (what you want extended in time, let's just say, T ) of the set X and Y.
As seen above, if the error xy is smaller than x in module, then the set Y has a causal influence on X ("Y G-causes X"). Obviously we can not use the difference between the two errors to define the indicator D (l) (or D (nl) ), otherwise we would commit the same above error: we will use, rather, the ratio of their variance in the following way which is, of course, time-independent (in fact it is calculated on the entire future of the time series). The extension to the nonlinear case trivially occurs with the same procedures as previously seen. We will return on the value of the m parameter, the order of the model.
What makes crucial the calculation of G-causality is not, however, the model used (linear or not, of any order), but the fact that this, as presented so far, does not yet take into account the contemporary influence that each other time series may have on the series under consideration (in the case of electroencephalograms we are going to deal with, there are about 60 time series influencing each other). In any case, the mathematical structure of GC is well suited for multivariate extension (MVAR), and to prove it [37] let's consider as an example the case of three time series X, Y and Z influencing each other (higher dimensional cases follow immediately from this).
If we insert the three series in the VAR model we obtain: where the sum over the index j from 1 to the order m of the model is uponintended. Now the errors (.) (t) consider both the case of the simultaneous presence of the three series and the case in which, case by case, one of the three series is not taken into account. In the same way, the corresponding MAR model must consider all possible combinations.
A simple way to proceed is to consider separately the covariance matrix of the complete model: and the n − 1 covariance matrices of the restricted models, or better the models in which, case by case, a time series is omitted: The meaning of the matrices elements we have just seen is quite straight: for example, if we consider only the elements xyz and xz we have: while the second, the Bayesian Information Criterion (BIC, [41]) assumes that the function to be minimized is The BIC model is certainly the most used in neuroscience, as more agile when we have to manage particularly extended data series. However in cases where the BIC (or AIC) provides or an order m too high for the calculation to be computationally efficient, or even in the case in which there is not a defined minimum (presence of more than one only minimum or even plateaux ), then the parameter m can be chosen smaller [37], provided the models BIC/AIC substantially do not present further decrease beyond that limit [39].

Equivalence between Transfer Entropy and Granger Causality
Let us return to the formulation of the multivariate GC and show that, in the calculation of the lowest linear order of GC and TE, the two quantities are equivalent as long as the original time series exhibit a Gaussian distribution.
In a slightly different formalism from the one we have used till now but absolutely equivalent due to Geweke [37] and already used for the multivari-ate redefinition of TE (cite paragraph 1.6), if we denote by Σ( ) the error variance of (·) , then (3.41) can be rewritten as follows: Let's assume now that the time series on which TE is calculated are characterized by a Gaussian distribution, for which we can rely on the relationship between the entropy of the time series (let's call it X) and its variance [42]: where n is the dimension of X. For the conditional entropy there is a very similar relationship. We start, in fact, from considering the well known property [42]: n ln (2πe) (3.46) and remember that, at the same time: By inserting at this point the known matix identity [43] A B we easily obtain ie the profit result: that, extended to the case of three variables (with a recursive procedure) and substituted into (1.33, Chapter 1), returns:

Introduction
The analysis presented so far, and in particular those relating to the connectivity between different brain areas, only fournishes a partial view of what happens in the transmission of information between cortical areas: the TE, like the GC, provides simply a "map" of functional activation that can be studied in greater detail with more sophisticated tools as network analysis, the so-called Networking [44].
Born in computing science as an evolution of the classic graph theory, this theory deals with the description, both global and local, of the connection between the sub-components of a complex system, as one can consider the brain (in our case, the cortex only, with related areas) [44].

Networking Principles
A network is the mathematical representation of a real system in what you want complex, defined by a set of nodes (or vertices) and connections between them (links, or edges).
Basically there are three different types of networks that can be studied: anatomical, functional and effective. The first studies the structure of physiological connection between brain areas, highlighting traits of organic matter that physically connect them. The other two, instead, study respectively the correlation and the causal relationship between the electrical activity of various brain areas: one can say that while the first is a "physical" network for the brain, the other two represent the network of causality and information flowing through it.
In BN, nodes represent specific areas of the cortex: in particoular the scalp areas covered by individual electrodes that can be later reorganized into the more "classical" frontal, central, temporal, parietal and occipital lobes, left and right in the variants [45]. What is particularly relevant is that the mounting pattern of the electrodes must cover the whole scalp as much as possible without overlapping, in order to avoid effects of masking of the actual measures or create "ghost" areas that would appear, as will be clear later, particularly segregated [46,47]. It 'also important to emphasize that the assembly diagram and covering of the scalp (parcellation) should be uniquely determined and never changed during the analysis, as the results differ according to the same scheme [48,49]. In the present study, the anatomical connectivity will not be considered, since -at the moment-the results cannot be compared with those of the other two types of network: these, in fact, start from the assumption that the information transmission takes place at cortico-cortical level, while the  Also, notice all the motifs (such as the blue line pattern) and the triangles (the red circuit, which can not be path reversally) present in the network.
anatomical connectivity (deduced, for example, by an MRI tractography) generates a network whose connections are cortico-subcortical in addiction.
For sure, the next step in the description of the brain real network will be the interconnection and simultaneous interpretation of the anatomical and informational network, but will not be the subject of this work.
We will therefore focus, as we shall see in more detail, on the functional and effective connectivity, favoring whenever possible the latter, as the as- (symmetrized) versions of the same matrix: the first will always be preferred to the latter. In other words, effective connectivity will always be preferred to functional one.

The threshold problem
Before we can use them for analysis, any element whose value is particularly small must be properly removed from the connection matrices, as they represent spurious or little significance connections that can obscure the topology of strongest and most significant connections [52]. It is therefore necessary to apply a threshold to eliminate those items.
The choice that can be made is twofold: the threshold can be absolute or relative (or adaptive): in the first case an absolute minimum value is chosen, independently of the model used or the applied stimulation, under which threshold each array element is reset. In the second case a value between 0 and 1 is chosen which cuts, for each activation matrix, each element whose module falls below the product between this value and the largest matrix element in the matrix: in practice, every element whose module is below a certain percentage of the maximum value of the entire array is set to zero.
This second method is more efficient than the first, since it allows to avoid inconsistent results as identically null matrices (that is, a "totally disconnected network") and allows the study of the network basing not on the intensity of the connection but according to the relative dynamics between its parts [53].

Caracteristic features of BN
A single measurement made on the network is able to characterize various aspects of brain connectivity, both at local and global level. Throughout this section, we present several measures that, in different ways, describe different aspects of functional integration and segregation, which quantify the importance of the individual cortex areas, characterize the information routes and the different possible circuits that can be followed by information, analyzing the potentialities of a network to restore the connections that may have interrupted due to any cause (pathological, physical, etc..).

Basic Measures
Each measurement made on BN is based on the definition of some fundamen- Let a i,j the connection status between i and j: if a i,j = 0, then the two nodes are disconnected, otherwise they are defined close (a i,j = 1 in the binary network, a i,j = w i,j if the network is weighted: from now on, the weights will be considered normalized, so that 0 ≤ w i,j ≤ 1).
For a binary network, the number of links is defined as while for the weighed ones it has the form Note that for undirected networks, a i,j = a j,i , therefore in (4.1) and (4.2) each link is counted twice, while for directed networks each link is counted once and only once, each of them being directed from node i to the j in a unique way.
The first measure that can be made on the network is the degree of each single network node, considered as the number of connections (total, ingoing and outgoing) departing from each single node: For the direct version we can distinguish the degree in two ways, for the incoming and outgoing traffic: In this context, the degree is a measure of how many nodes are close to the actual node (neighbor node). If the network is undirected, the number of ingoing connections equals the outgoing ones. In any case, the total degree is the sum of two degrees in input and output. Moreover, the degree is a measure of the importance that a node (or a cluster of nodes) plays within a network.
The weighted version of the node degree, called strength, sums the intensities of each individual connection of a node instead of the individual nodes number: The set of degrees of all nodes in the network defines the degree distribution, an important indicator of the ability of the network to regenerate broken connections, actually dropped due to internal or external causes.
The average degree of the entire network is known as density, and is the wiring cost of the network: the higher this indicator, the more efficient the network.
We can try to measure the shortest path length between two single nodes i and j by firstly expressing with g i↔j the smaller path in length between node i and the j (ultimately, the geodesic between two nodes) and then summing over all the links belonging to them. The weighted version also requires the calculation of the map (ie, the inverse matrix) from weights to lengths, f (w i,j ): The directed version, as for the degree, distinguishes the distance from i to j from the inverse, replacing the generic geodesic g i↔j with the direct one It is important to emphasize how the characteristics of a network are strongly influenced by these initial measures, as well as by the number of nodes and links. For this reason, wheter we have not at least two populations to compare among them or we want to test the null hypothesis that our results are only artifacts or due to casual distribution of nodes within the network, it is possible to artificially generate comparison networks for each measure (the so-called null model, NM) that has to share all these quantities with the real network under investigation. On the other hand, the topology of the NM and its node's spatial distribution plays no role: different topologies are allowed, ranging from random to fractal [54].

Functional Segregation measures
Functional segregation refers to the network tendency to let certain processes take place only within particoluar areas that, most of times, are inner densely connected and present, at the same time, a quite reduced number of connections from and to the outside compared to the internal ones.
In the first place, the segregation measures quantify the number of such areas, called modules (or clusters), the presence of which indicates the possi-bility that the related areas of the network are segregated. Alternatively, it can be stated that, in functional and effective networks, the abnormal presence of clusters is an indicator of a possible strongly structured hierarchy in the information path.
The simplest measures of segregation are based on the counting of triangles (or polygons, closed circuits in the patterns followed by the information, consisting of n nodes and the same number of link) in the network, whose relative abundance is an important factor in the study of segregation.
We will define the triangles around the node i in the binary network by means of the following relation: while for weighed and directed ones the following definitions apply: (a ij + a ji )(a ih + a hi )(a jh + a hj ) (4.10) For each single node, the fraction of triangles surrounding each vertex is defined as the clustering coefficient therefore the entire network clustering coefficient can be calculated as follows: with the usual weighted and directed variants: By definition, the clustering coefficient is defined only for k i ≥ 2, otherwise it is identically zero. It can be proved [55] that such a quantity is equivalent to the fraction of nodes that, in triplets, are close to each other.
The average of this coefficient throughout the network reflects the presence of a strongly centralized connectivity around one or more nodes (or areas).
Following the definition of the clustering coefficient, we note that this indicator is, indeed, strongly influenced by the presence of low degree nodes, being normalized node to node. A variation of this coefficient, the transitivity is normalized with respect to all nodes in the network, and so is not affected by such a dependency [56]. Note that the transitivity is not defined for the single node, but for the entire network.
More sophisticated measures of segregation not only describe the presence of densely interconnected cortex areas, but can even reconstruct the exact size and composition of these groups. The latter variant, the composition, is determined by the modular structure subdividing the network into groups of nodes having the highest possible number of inter-connections and the minimum possible number of extra-connections [57].
However, the level in which the network can be divided into these subgroups (which should, in any case, be non-overlapping) is measured by the modularity feature [58], that despite all other measures, is not calculated exactly but via optimization algorithms [59], which generally sacrifies a few degrees of accuracy in favor of calculation speed.
If we denote by M the set of all the non-intersecting modules the network can be divided in and by e uv the total fraction of link connecting the modules u and v, then the modularity Q is defined by the relation In a completely equivalent way we can reformulate the expression of Q according to Newman [60], getting where m i and m j represent two modules contained in M, l is the total number of links and δ m i ,m j is the usual Kronecher delta. From this expression it is possible to derive the form of modularity coefficient for weighted and directed network: However, as we will see in the section concerning centrality, for our purposes it is not imperative that modules remain always separated from each other, and indeed will be important to identify those nodes (or cluster of nodes) playing a junction role between two or more modules: it will therefore be necessary to consider the hypothesis that one or more nodes belong simultaneously to two or more modules. In this case the algorithm developed by Palla [62] seemed to us the most appropriate for such a description, keeping in mind the fact that the regions of the cortex that we consider in our studies are not exactly juxtaposed zones, but show a minimal overlapping (as it is evident considering the names of the electrodes that define their position on the scalp).
Finally, we define the concept of Local Efficiency of a node (or a cluster of nodes) as the tendency of that node to communicate with its neighbors using the shortest possible path.
So let d jk (N i ) the length of the shortest path from j to k passing only through the neighbors of node i. The local efficiency of node i is defined in the three variants as while the weighted and directed version, we have:

Functional Integration measures
It is important to note that, while for the binarized version of the connection matrix the value of the characteristic length is evaluated by adding the active links from node i to node j, for the weighted version the bond lengths are added, being the latter quantity inversely proportional to the weight of each link, as larger weights are indicative of stronger ties and, as a result, smaller characteristic path and an higher "closeness" of two nodes.
The inverse of the average length is defined as the network global efficiency [63], that for the single node can be calculated as ij is always the inverse matrix of distance between two nodes, but in this case we deal with the average over the entire network), so the global efficiency of the whole network is and similarly for weighed and direct versions: Unlike the path length, the global efficiency can also be calculated on partially disconnected network: the path length, in fact, is defined as "infinite" in case of two disconnected nodes, and this leads to both an infinite characteristic length and to null efficiency. In general, the path length turns out to be strongly influenced by longer routes, while the global efficiency is mostly influenced by short ones. According to some authors [64] this makes the global efficiency a much more effective and significant measure of functional integration.
In general, it is possible to demonstrate [53] that effective networks are statistically more globally efficient if compared to functional networks, which show a lower integration between different modules.

Small-world Connectivity
The different instances of segregation (strong hierarchical structure in information processing and a few extra connections) and integration (strong interconnections between modules) are studied by the so-called small-scale connectivity (or small-world brain connectivity), which measures how much the behavior of a network is close to that of an ideal, that is a network presenting the same number of nodes and links as the one under examination, but whose spatial distribution ensures that the work done by highly specialized areas is efficiently redistributed toward other highly specialized areas as well.
This linking ability appears to be carried out by the anatomical connectivity, which is not indeed the subject of our research. However, a certain number of studies [65] have shown that by combining functional connectivity characteristics with that of the effective, one can bring insightes about the small scale connectivity, as both of them are able to highlight modular structures within the network and, by comparison, the effective connectivity also highlights connecting structures between separated areas: facilities which simultaneously show high segregation and integration will be caracterized by a high small scale connectivity, while others with high integration and low segregation properties will have characteristics very far from that of the ideal network.
Alternatively [66], using the definition of connectivity on the small scale, we can define the small-worldness of a network by comparison with an artificially generated one with random topology: a network shows small scale connectivity if it is much more clustered of another with the same characteristic path length.
Let then two networks be given, one of which being artificially generated by a computer and having a random topology, and let respectively C and C rand their clustering coefficients. Let also L and L rand the characteristic lengths of the two networks. We define the small-worldness of the first network as compared to the second in the following way: If this ratio is much greater than unity, then the first network shows the peculiar characteristics of microconnectivity. The extension to weighted and directed network is trivial.

Motifs of the network
The global measures (ie, extended to the whole network) of the different pre- and their presence becomes more important as the numerosity with which they occur grows, often referred to as the motif z-score.
The latter quantity is defined by considering not only the network under investigation, but also a number of other artificially generated networks with a random topology, each with its own occurrence rate of the pattern h. If such a set of network presents a standard deviation σ J h ,rand for the distribution of Considering, however, the motif distribution around individual nodes, one can get the node fingerprint in the network, characterizing the functional role of that node (or region of the cortex) which belongs [67], using the already considered motif occurrence precentage J h : in which J h,i is the h pattern occurrence pertentage around the node i. For weighted network it is necessary to introduce the concept of intensity of the motif h: where the sum is extended to all patterns where this h is present and L u h is the set of links in the u-th element of summation. On the basis of these quantities we define the motif z-score intensity: (4.34) and the motif fingerprint intensity as: It should be noted that the functional and effective networks motifs only partially coincide with those of the anatomical network, as the firsts only partially use the paths described by the latter, which is the reason why it would be more correct to separate the concept of functional and anatomical motifs, defining the firsts as the set of all possible closed paths that can be included in tha anatomical ones.

Centrality measures
Some areas of a network plays an important role as hubs of information and connection between highly specialized areas (ie, segregated), and also their role -as we shall see in the next section-becomes more and more important when due to external or internal causes, the different areas can no longer communicate with each other. This is the paramount concept of centrality, which can be extended to both the cases of individual nodes and the whole areas of the network (in our case, the different cortex areas).
The first measure of centrality we can consider is the degree, already seen in the introduction to this chapter, having an immediate interpretation: areas with a particularly high number of connections (both incoming and outgoing) are more likely to interact and sort the information.
Degree measures can be specialized by considering the intra-modular and inter-modular variants, describing connectivity and centrality within the specialized modules (ie, the previously considered segregated areas) and between them [68].
The intramodular z-score is calculated, for each node, identifying which module belongs to the node i (let's call it m i ) and counting the number of links between i and all other nodes in m i : if we denote this number by k i (m i ) and byk(m i ) and σ k(m i ) the mean and standard deviation respectively, we The participation coefficient is defined by means of the same quantities: Nodes (or cluster of nodes) having high intramodular degree and low participation coefficient (known as local hubs) have an high probability of playing a key role in facilitating the segregation of an area, while an high participation coefficient corresponds to an high probability for that node (or cluster of nodes) to facilitate the intermodular integration (connection hubs).
Most of the measures regarding centrality start from the assumption that the central nodes (or areas) are present in many patterns within the network, especially the shorter ones, thus acting as a control clearinghouse for information flows. For example, the closeness centrality is defined as the inverse of the average shortest path between a node and all the others in network: for the i-th node it is defined by the relations Similarly, the betweeness centrality is defined as the fraction (on the total existing in the entire network) of shortest paths passing through a given node: where ρ hj is the number of shortest paths from h to j and ρ hj (i) is the number of shortest paths from h to j that passes through i. We can get the weighed and directed versions simply by replacing In this sense, nodes with a very high betweeness centrality behave as a bridge to connect two or more nodes otherwise separated. Naturally, the concept can be extended to links, not only to nodes, and in this way we can get two variants of the betweeness centrality: the Vertex Betweeness (VBC) and the Edge Betweeness (EBC), both of them calculable with fast algorithms such as those of Brandes [69] and Kintali [70].
It should be noted how anatomical nodes showing a strong centrality are also those who, connecting regions otherwise disconnected between them, facilitate their integration, increasing their functional connection. At the same time, however, these anatomical links make the functional centrality of these nodes less important, which are consequently less easy to spot.

Resilience Measures
Resilience is defined as the potential ability of a network to recover its functionality as a result of interruptions (pathological, structural or whatever) of the connections between the various constituting modules: in fact, the deterioration of the anatomical connections between brain areas inevitably leads to a similar deterioration in functional and effective connectivity both among the directly injuried areas or even between areas using that one as an information hub. The analysis of the network allows us to highlight those network areas that, in case of such a damage, are able to restore the lost connectivity.
The resilience measures can be subdivided into two classes: direct and indirect ones. By their nature, direct measurements [71] require long periods of observation and experimentation for the measurement to be performed, as they test the actual adaptation of the network to the progressive deterioration of anatomical and, by consequence, functional connectivity. In this case, which is not of our interest, the applicability of the chosen measures has to be possible on disconnected network, as one or more areas can evolve this way over time.
The resilience measures of our interest, however, are the indirect ones, which test the potential capacity of recovery of networks and areas presenting an high risk of deterioration.
One of such a measures is the degree distribution of the network [72].
If p(k i ) is the probability that the node i has degree k i , then the degree distribution can be defined as follows: (4.49) and the weighted and direct versions can be trivially obtained. Nodes or areas presenting low grade power laws distributions may be more resilient to deterioration of the random networks, but at the same time are more vulnerable in case of damage of high egree central nodes.
As the real network does not always follow a defined grade power law, it is possible that some modules inside it present defined degree distribution (most of the times, a low grade). The analysis of these areas can result in some cases useful the entire network resilience study [64].
A second measure that can effectively characterize the resilience of a network is the assortativity coefficient, a measure of the correlation between the degree of the nodes at the beginning and at the end of an information path: The positive assortativity presupposes that the two ends of the paths consist of nodes (or clusters, particularly hubs) interconnected between them and of comparable resilience; vice versa, a negative coefficient expresses the possibility that the central nodes are uniformly distributed in the network and that, consequently, their eventual removal makes vulnerable the entire network generating an unstable behavior.
Two other measures that may characterize the resilience of a network are the local assortativity coefficient (ie, a localized version of the coefficient of assortativity, only extended to modules or areas) and the average neighbor degree [73]: Low values of these indicators show a high risk of impairment for the entire nestwork if the nodes which are referenced were removed.
Chapter 5 EEG Data analysis

Introduction
The tools we have presented up to this point can be applied to various fields of research and inference analysis, ranging from econometric data (for which they were originally developed, although not all of them) to the weather data, and so on.
The analysis of electroencephalographic data is intended to highlight, through the techniques set out above, any differences which may exist between two or more populations of migraine patients in terms of total quantity of information flowing between different cortex areas or efficiency that they may have in distributing, processing it and so on.
For this purpose, it becomes necessary to consider at least two populations in each study, one of which must be considered as a reference for the analysis (in the present case, the so-called "controls", or patients not suffering the disease under study). Moreover, the choice of the particular stimulation, of course decided by the neurologist, has been made time by time in such a way as or to stimulate areas particularly suited to recipe it (for example, visual stimulation, which affects the parieto-occipital areas, is used to highlight any cognitive deficiencies of a population or to verify that, in migranic patients, their relative neuroelectric activity is more intense than the others, as it is well known in neurology), or to affect as many areas as possible in order to verify the integration state of the network (it is the case of painful laser stimulation that, by acting directly on the brainstem system and on pain receptors, arrives directly to the cortex without passing by one of his areas).

10-20 system
The data considered in this work, as widely mentioned in previous chapters, are of biological origin, and precisely they constitute the recordings of electrical potentials measured at specific sites of the human scalp, arranged in such a way as to cover as much as possible the entire extension of it (the so-called "channels"), each of which records a signal whose magnitude is a few tens of µV and whose value is the sum of all the potentials underlying the cortex area covered by the electrode. Our first work, however, has been conducted using only a 12 channels system for data recording, only 6 of which almost free from artifacts to be taken into account, as a result of an older registration system with respect to the one that has been adopted later. However, both recording systems share the arrangement of the common electrodes, for which the system consisting of 61 channels can be considered to all effects as an extension of the one with 20 channels, and is indeed a model with an higher spatial resolution.
An important consideration is to be done. As it is well known, not all

Frequency Bands
As all the signals varying in time, the EEG too can be considered as the sum of different contributors sine wave at different frequency (or pulse). However, a spectral characterization would not be easy to treat, since the number of sinusoidal components of the signal is theoretically infinite.
It has been thought, therefore, to use not the individual frequencies as Specifically, there are four to six bands, depending on the conventions on two sub-bands. This convention provides the following distinction: • δ band, from 0.5 to 4 Hz is the bandwidth of long waves and unconscious activity, associated generally to deep sleep; • θ band, from 4 to 7 Hz associated with light sleep or drowsiness; • α band, from 7 to 12 Hz is the first band of consciousness associated with the resting state and relaxation with eyes closed while awake conditions; • β band, from 12 to 30 Hz, is the bandwidth related to the neuroactivity of waking and attention.
This is the four-band system that will be used from now on, but it is not the only possible choice: some authors prefer to split the α band in two equiextended bands, α 1 and α 2 , other authors prefer to include the γ band, with frequencies from 30 Hz on, but belonging only to frontal areas. Obviously the choice is of convenience and reflects the specific needs of a given research work. Moreover, the separation of the bands is not always so clear: their thickness is often extended right and left of each interval of 0.5 Hz, so as to obtain not separated bands, but intersecting.
The localiztion of the brain waves is not a prerogative of the γ band only: even β waves are present in the front and central areas almost exclusively. In addition, the alteration of some frequency component is closely related to specific disorders: the alteration of δ rhythms can be an indicator of a medium-to-severe disorder of brain activity, while the alteration of θ rhythms is often a sign of a mild-to-moderate disorder of cortical activity.
However, this topic is beyond the scope of this work, and for a more in-depth case studies, please refer to specialized medical literature.
The choice of the bands to be used (all the four bands, or only α and β, and so on), however, is not always related to the type of stimulation performed on the patient: the flashing light at a certain frequency will not affect necessarily the frequency bands in which it is located but even its neighbors, in something like a resonance effect, as well as a laser stimulation painful not only affect the lower and the higher band, being the processing bands of the stimulus during wakefulness (β band) and pain (δ band): it will be seen in fact, as the neural response interests quite all the frequency bands.

Data pre-processing
Before analyzing the available EEGs, it is necessary to clean up data from all kind of artifact that could lead to significant errors both from the point of view of measurement of the real quantities, both from the one of results comparison among populations (patients suffering from a specific pathology, healthy patients, etc.).
Following the procedures universally adopted by the scientific community, the technician responsible for carrying out the recording session and the neurologist have first applied a digital band-pass filter in the range of frequencies between 0.1 and 70 Hz, and then applied a Notch filter to remove interferences and harmonics at 50Hz from the power line that may have affeted the recording.
Finally, an electrode has been chosen to be the reference for the measurement of potential differences, namely Fpz. The electrodes used presented a total impedance of less than 10 kΩ.
The band filtering was performed using a two-tailed Butterworth filter of the second order.

Potentials evoked by visual stimulation (SVEP): flashing light
In the studies performed on functional and effective connectivity of migraine in all its various forms, the first two types of stimulation which patients were subjected to were those from visual flashing light, belonging to the category of SVEP: "Steady-state Visually Evoked Potential". The first type of stimulation considered will be the simple flash with different time frequencies.
The available EEGs were from 19 patients suffering from migraine with aura (MA) selected in accordance with the ICHD-II procedure [75], 19 subjects experiencing migraine without aura (MO), whose diagnosis was made according to the same procedure, and 11 healthy subjects (N), selected among the nursing staff of the Neurology Department.
Analysis of statistical inference (ANOVA and χ 2 tests) have shown that there are not imbalances in populations in favor of male or female and there were no peacks about particular values of the age distribution. It has also been verified that the MA and MO patients were temporally distant from the last migraine attack and sufficiently furthest from the next, in such a way as to exclude any effects due to the pre-attack dysfunctional state. Such a condition has been verified by a telephonic interview after the recordings have been perfromed.
The visual stimulation has been conducted through an intermittent strobe with radiant energy of 0.29 J, located 20 cm from the patient, who was asked not to close eyes during stimulation.
Five different frequencies of stimulation have been used: 9, 18, 21, 24 and 27 Hz, whose choice was not casual, but based on previous physical and neurological studies [76,77]. Each of them was delivered to patients for about 40 seconds, with an interval of 20 seconds between them, so as to be certain that the residual effect does not interfere with the next. For each stimulation about 20 trials were performed.
The potentials were recorded through six electrodes: two occipital (O1 and O2), two parietal (P3 and P4), one central (Cz) and one frontal (Fz), the assembly of which was referred to nasion.
The frequency bands used in this first work were only the highest two, α and β.

Analysis of the spectral power
Using the Fast Fourier Transform for discretized signals (FFT), and selecting from time to time the signal strength at the stimulation frequency, it was found that, in correspondence with this frequencies, the EEG signal of MA and MO was significantly more powerful than that of N, in all channels except those in the fronto-central region (Fz and Cz, figure 5.3).

Analysis of functional connectivity: Synchronization Entropy
Once we assigned the two weights: (m, n) = (1, 1), it has been possible to proceed with the phase synchronization analysis. In order to reduce the number of multiple comparisons and then lower the Bonferroni correction factor (see later), we proceeded comparing, for each stimulation frequency, not the absolute synchronization factor but the one relative to the base, defining the following factor for each band Γ = ρ f lash − ρ spont   In β band, however, we are witnessing the effect of desynchronization of MA with the increasing stimulation frequency ( figure 5.4, bottom panel), an effect that is also present for the N, although with less intensity.
There is also a clear difference between the two groups of patients with migraine: the MA de-synchronize faster with the increasing frequency of stimulation than in the case of MO.

Analysis of effective connectivity:
Granger Causality

Conclusions
As a result of light stimulation, MA and MO show a different behavior in their neuroelectric activity. In the first population, in fact, an increase in effective connectivity in β band has been observed, while in the latter there is a similar increase in α band. Such a difference may indicate a different neurosynaptic behavior between the two phenotypes of migraine [83].

Evoked potentials from photic checkerboard
The second type of light stimulation that has been considered was the one with bright intermittent checkerboard with two different spatial dimensions: the first with 5 mm edge and the other with 20 mm; the stroboscopic frequency was always kept at 5 Hz.

Such kind of stimulation was suggested by neurologist in accordance with
Shibata hypotesis [80], in wich a photic checkerboard stimulation could lead to important differences in cortical activity of migranic patients.
The EEG used in the connectivity analysis consisted of 61 channels recordings of neuroelectric activity during the visual stimulation, whose temporal extension was about 5 seconds. In addition to these, also the unstimulated neuroelectrical activity recording, the so-called bases, were available.
The sampling rate was 256 Hz, and the recordings were made by placing the stroboscopic checkerboard, whose radiant energy was 0.29 J, 20 cm from the patient, which were asked to keep eyes wide open during the stimulation without blinking their eyes.
Based on these considerations, the statistical analysis was conducted, for each channel and for each band, using the one-tail (alterned left and right) t-Student test with a Bonferroni correction for multiple comparisons equal to b = n couples × (n stims − 1) = 3 × 2 = 6 (5. 2) The Bonferroni correction is a post-hoc compatibility test, used to increase the confidence level in comparisons for a difference to be considered as statistically significant, and is performed when the number of comparisons is greater than one: in this case the universally accepted value of 0.05% c.l. is lowered to 0.05% / b (in our case, 0.008%, or 0.004% for each tail).
In this type of analysis, we decided to study the behavior of populations In this context, we decided to use the topoplot system to visualize results. Such a system is easy to understand: the human head is pictured from above, and each single channel stands for its particular position on it.
The convention we used is the following: if the comparison on that particular channel gives no distinction, it remains green in color. If, instead, the channel gives rise to distinction, we can deal with two different pictures: the first is used in the functional/effective analysis, in which the distinguishing channel's color is as deep as the number of the external channels which the channel communicate with increases; in the second, mainly used while dealing with BN features, each distinguishing channel's color depth is as deeper as the percentual difference among populations increases.
The color convention we used is the following: warm colors if the distinction in in favour of the first element of the comparing couple (eg, in MA/MO case, red is referred to MA), cool colors otherwise (in the last example, blue is the distinction in wich MO levels are larger).
In this way only we can deal with few diagrams fournishing both as much information as possible and a clear location of the distinguishing areas.
5.6.1 Analysis of effective connectivity:

Transfer Entropy
By means of Kolmogorov-Smirnov test we verified that the data distribution for each channel were parametric and, with the subsequent Spearman posthoc test, if the gaussian trend of their distribution was verified. Only at this point we was able to carry out the statistical analysis of data.
The first parameter that must be fixed in the study of effective connectivity is the lag, which affects strongly the value of cortical connectivity. An

Analysis of Brain Network
The analysis of the BN began with the study of the overall behavior of the network in function of the applied threshold used to eliminate spurious connections. As we saw in section 4.3, it was decided to apply a adaptive threshold to compensate for the different connections intensities characterizing the network of each patient. Regarding the statistical treatment of data, it was decided to use a double investigation key: on one hand we have chosen to continue using the Student's t test on each channel, so as to improve, as already said, the spatial resolution of the analysis, and on the other, it has been chosen to perform a parallel statistical comparison using the channels of a cortex area (for example, the right front, the left parietal and so on) as a single ensemble of data, in such a way that the comparison has not interested channels individually, but the entire cortex area.
So we have to manage three different cases: • the entire area and most of the individual channels differ significantly across populations: the case anymore simple, since the behavior of the entire area is determined by that of the individual channels; • the entire area distinguishes the two populations, but none or just a few channels do the same: in this case the variances of the individual channels in the two populations are too large for the means to be distinguished, but in the overall case, the global variance of the whole area (which is the squared sum of the channels individual variances whose the square root is extracted) may be lower than that of the individual channel, ensuring the entire area to give contribution to distinction; • the area does not distinguishes populations, but many of its channels do it: the single channel has, in the two populations, a small variance compared to the indicator mean value, but is globally "drowned" in the great variability of the area's variance, that in this way do not contributes.
Furthermore, each feature calculated on the basis of the real data, before being used as a tool of a possible distinction between populations, has been subjected to a comparison with similar quantities derived from computer artificially generated connection matrices (the NM, already seen in Chapter 4) with both random and fractal topologies: real and simulated data shared, indeed, degree, number of nodes, their distribution and the average paths length, so precluding the possibility that results were due to a random distribution of nodes and links.  Of a certain relevance, in the lower bands, is instead the behavior of the channels FCZ-CPZ, which have a large inter-modular Z-score (which does not correspond to a high level of participation) during stimulation with 20 mm checkerboard: it is likely that, at low frequencies, these channels behave simply as hubs for the sub-cluster consisting in frontal ↔ central ↔ parietal Finally, also the parallel between the couple of measures Z-score/participation shows no substantial differences between the two groups, if not for a series of channels, that indeed do not show a defined pattern of distinction.
MWoA/CONT comparison Among the two populations there are few signficative differences. One of these is the variation of the VBC in the left frontal area in β band that, if during the basal activity is 10% larger in CONT, during the stimulation is in favor of MWoA of the same quantity.
No definite pattern of distinction can be found for the EBC.
Even the pair of indicators Z-score/participation shows few differences between the two groups, and only at the basal activity level: in θ band only it can be noted that CP2, CP1 and CP5 channels have a larger Z-score in MWoA (about 50%) and a larger participation coefficient in CONT (a few percent). Reconsidering this behavior only by reference to the CONT characteristics, it can be stated that these channels act like local hubs for MWoA.

Resilience measures
MWA/MWoA comparison A significant difference between the two migraine phenotypes can be seen in the β activity during stimulation with 20 mm checkerboard, which shows an increase of assortativity in MWoA than  Finally, the analysis of resilience clearly shows how the capacity of recovering of healthy patients is higher than those, in sequence, of MWoA and MWA: the network of the latter, with assortativity and average degree distribution smaller than the others, are potentially less able restore connections if subject to breakdown.

Conclusion
What, maybe, is of paramount importance is the fact that the results, submitted to the neurologist opinion, have shown themselves in complete accordance with the known features of migraine with and without aura, so the used models and procedures can be considered as a fundamental tool to recover more aspects of the clinical assets of such a pathologies [81,82].

Evoked potentials by laser painful stimulation
The available EEGs consisted of 61 channels recordings of neuroelectrical cortex activity from both patients experiencing migraine (MIGR, 29 patients) and healthy patients (CONT, 16 subjects). The sampling frequency of the signal was 256 points per second again.
The painful stimulation was delivered by a laser striking the right hand, the power of which was regulated, subject by subject, just above the pain threshold and was no longer changed throughout the session. Before the stimulation arrival, the subject was verbally advised with an hint on the power of the laser pulse. This suggestion could be correct (threshold of pain, L (0) ), or misleading, inducing in the subject the idea that stimulation would be below the pain threshold (L (−) ) or definitely much higher than this threshold (L (+) ).
The recordings consisted, therefore, in two different sections of the same length: one second before the stimulation arrival reporting the response to verbal suggestion, and one second after the laser strike, recording the response to painful stimulation. For each subject and for each stimulation, from 10 to 15 stimulations were available.
The statistical analysis was conducted by means of the t-Student test (alternating the single tails) for the post-stimulation direct comparisons, with a Bonferroni correction equal to 4, whereas the comparison between the pre and post stimulation in the two populations was conducted using a two-way unbalanced ANOVA with a Tukey Kramer post-hoc correction.
Even in this case, given the high number of available channels, it was decided to study the behavior of the two populations for each channel, so as to improve the spatial resolution in the study of the cortical neuroelectric activity.

Wavelet Analysis
Using An abnormal activity is immediately visible until θ band in MIGR patients which, to some extent, anticipate the reaction to the real painful stimulation. Such a phenomenon, that is absolutely not observed in healthy patients, takes place about a second before the arrival of stimulation and is independent on the type of alert preceeding it.
Such an "anticipation" has a shape similar to that which, subsequently, will be produced by the painful stimulation, therefore represents, in accordance with the neurologist interpretation, a real response to the suggestion that MIGR offer to the pain stimulation. Its shape is also independent of the suggested intensity and the particular electrode considered.
From the point of view of the statistical treatment of the data, it was first necessary to normalize the wavelet images in both population, as the numerical value of each element was always larger in MIGR than in CONT, leaving no chances for a morphological analysis of the same.
Once data had been normalized so that they varied in the The result of the analysis has shown (figure 5.18) that during the painful stimulation, in the first three bands, the information levels exchanged across couples of channels was uniformly higher (up to 50%) in MIGR compared to CONT. But while the amount of input information was quite uniform over all channels, the output information presented differences in the spatial distribution: in δ band, for example, in all three stimulations, the amount of output information from the channels of the right frontal areas remained similar in the two populations, while in α band the output response differentiated depending on the stimulation: as the latter was preceded by the notice   With the availability of the EEG prior to stimulation, it was possible to make a comparison between the two phases, before and after, by means of a two-way ANOVA test, post-hoc corrected with a Pearson test. This comparison shows how the interested areas in the elaboration of stimulation always remain the same: what changes is simply their extension; in the lower bands (δ and θ) these distinguishing areas tend to shrink of about 10%, while in the higher tend to stretch of the same quantity. However, the phenomenon already seen of the two pairs of channels, CP5-P3 and CP4-P2, is still present, although less evident than in the stimulated case which, of course, amplifies the phenomenon.

Functional Analysis: Synchronization Entropy
The analysis of phase synchronization, conducted by imposing the two weights ferentiation on the basis of frequency bands.
In general, in the first three bands (0.5 to 12 Hz), the phase preceding the stimulation shows an higher synchronization level of CONT compared to MIGR (about 30%), with the exception of the two areas around the couples CP5-P3 and CP4-P2, which differs much less the two populations; in particular, when the stimulation L (−) is delivered, most of these areas are activated, and when the suggestion is correct (L (0) ) or is of the type L (+) , in those same areas, the synchronization is smaller, but still in favor of MIGR in intensity.
After stimulation, the latter are uniformly more synchronized compared to CONT, showing a strong synchronization in the areas around the usual channels CP4-P2 and CP5-P3: in this case each of the channels in this areas synchronizes with on average with about 60-70% of the other channels.
However this strong variation of the post-stimulation compared to the pre-stimulation is much more evident when stimulation is preceded by misleading advice; but when the stimulation is preceded by a proper notice, the cortex synchronizes virtually the same areas that were prepared to recipe the painful impulse in the pre-stimulation, simply increasing the synchronization probability. Many of these features will come back when the response of the information network to stimulation will be discussed.
The β band, however, shows a milder difference between pre-and post-

Brain Network Analysis
The choice of the threshold to be applied to connection matrices has been made considering the behavior of different quantities with the threshold the same. Those who did not show an ascending or descending monotonic trend (such as degree, clustering, efficiency and so on) showed a maximum at the adaptive threshold of 60% (figure 5.22): this means that any element of the activation matrix that is located below this value will be null, and the link between the relative nodes will be removed from the network.
Integration measures The analysis of Characteristic Path Length (CPL) shows a larger average length in MIGR connections (about 30%) compared to CONT (figure 5.23), as well as radius and diameter, pointing out the fact that information, in order to be delivered from a node to another, has to follow longer paths, making MIGR for sure less integrated than CONT.
Moreover, the analysis of eccentricity (a measure of the isotropy in informa- of about 5% as the suggestion was preventing more intense stimulation. In any case, the extension of these areas is reduced with the frequency bands and the right hemisphere is much more interested than the left. The simultaneous analysis of the local efficiency shows that, especially in the uppermost band and in those same areas, MIGR are more efficient than CONT of about 10%, while the latter are 15% more efficient in the areas across the two hemispheres, mainly in the L (+) stimulation.
The combined effect of this analysis seems to suggest that MIGR, during stimulations, specialize areas at the right and left side of the longitudinal midline of the cortex, but without showing the same efficiency of the CONT across that line, generating a reduced ability of intercommunication between the two hemispheres.   Resilience measures The analysis of the assortativity coefficient and of the network degree distribution indicates, especially during L (0) and L (+) stimulations, that CONT shows an increased ability to restructure the pattern followed by information (figure 5.29). Even more interesting is the fact that this happens regardless of the verbal stimulation that precedes the painful strike. It is therefore a factor independent of the particular response of healty patients.

Conclusion
The analysis double stimulation sequence (verbal + painful) showed many interesting differences in the two populations' behavior. First of all, it showed how MIGR anticipate the activity in θ band before the arrival of the painful stimulation in the largest part of the cortex regardless of the warning on its intensity (held constant).   The analysis of effective connectivity showed that the amount of information exchanged between the various cortex areas is different in input and output and depends on the type of warning preceding the painful stimulation: in the first three frequency bands, the TE level is significantly higher in MIGR, but while the distinction in input is almost uniform over the entire scalp, the outgoing TE levels differ depending on the stimulation and on the bands; in any case, the distinction areas seem to be expanding as the notice was advising for highest intensity stimulation.
In β band, however, it appears to be a channel (P3) showing, in contrast with the generally higher TE levels in MIGR compared to CONT, a similar behavior for the incoming information in both populations. In output, however, the distinction between the two populations covers a wide area across the two hemispheres, the extent of which increases as the migranic patient is suggested to be hit with a more painful stimulation.
The comparison between the pre-and post-stimulation has also highlighted as areas that verbal stimulation has somehow "prepared" are then actually used in the processing of the painful stimulus itself. The only difference between the two phases consists in the extension of these areas, that in the transition between the two phases tend to either retire (δ and θ bands) or expand (α and β).
In this context, P3, CP5, CP4 and P2 channels form a structure in the form of a "pincer" that from the parietal-occipital area goes up to the central one, and mark a significant difference in the functional connectivity of the two groups: such a structure, altough well-known in neurological literature,

Conclusions
At the end of this work, it can be affirmed that the performances and the results of the different indicators of the brain activity/connectivity are openly a useful tool to investigate celebral dynamics, as its results have always been confirmed to be correct on the basis of the actual neurological knowledge.
Moreover, they have stresed the fact that, in addiction to the already known distinguishing features of the different phenotypes of migraine, they are also capable of pointing out particular new aspects of connectivity and specialization of cortical lobes and, when it is possible, of smaller scalp areas, that is more restricted neural clusters.
The improvement of computing performances and of spatial resolution in EEG recordings will be capable of obtaining new tools (as the information storage, the still-under-evaluation evolution of transfer entropy) for the connectivity investigation and more powerful features of brain networking, such as the Rentian Scaling, so that to have a more complete picture of the brain dynamics.