|
|
Gene expression experiments from developing barley seeds between day 0 after fertilization (daf) and 26 daf are under
investigation.
Macroarray technology is used to study gene expression intensities. The above images refer to laser scans of radioactive
spots of hybridization membranes. Further processing, such as described in
another project of the group, yield two basic data views:
-
Inter-experiment view focusing on the problem, how the single experiments, containing several thousand gene expression
values, are related to each other. The visual grouping of the experiments helps to identify driving-forces ('axes') of
interest, such as developmental stages, tissues, or stress.
-
Inter-gene perspective trying to find functional clusters of gene profiles, such as co-expressed genes, or temporally
delayed expression dynamic. The correlation of gene expression profiles is of interest.
Tools developed in this project in order to facilitate analysis comprise two types of models:
-
Supervised data models make use of labeled data, such as the classes of early vs. late developmental stage, which are
connected to expression data. These classes can be used to construct classifiers that allow to classify, i.e. to predict
the class of unknown data. Common methods are C4.5 classification trees or k-nearest neighbor. Here,
supervised relevance neural gas (SNG) is used.
-
Unsupervised data models are used to cluster high-dimensional data or to reduce the data dimensionality. Principal
component analysis (PCA) or self-organizing maps (SOM) are standard tools for visualization and clustering respectively.
Here, an efficient variant of multidimensional scaling for high throughput application (HiT-MDS) is presented.
The methods found at project page site use paradigms of soft-computing (artificial neural networks).
|
<<< top >>>
|
Supervised (Relevance) Neural Gas with General Metrics, S(R)NG-GM
|
|
and Generalized Relevance Learning Vector Quantization (GRLVQ)
|
|
Supervised Neural Gas is a well-established prototype-based classifier that combines learning vector quantization (LVQ)
with neural gas (NG). It implements a gradient descent on a misclassification cost function into which a custom similarity
measure can be integrated. See this draft article for a detailed description of generalized relevance LVQ (GRLVQ) which is
a specific realization of SNG.
The available package is distributed under General GNU Public License.
snggm package: Implementation of supervised neural gas with general metrics, source code.
Sources require ANSI compatible C compiler (gcc preferred).
Cygwin and linux environments with gnu utilities are recommended for full functionality.
Download: gzipped tar archive snggm-11-oct-2005.tar.gz
(267 755 bytes)/ zip archive snggm-11-oct-2005.zip (272 815 bytes).
See README.txt.
|
<<< top >>>
|
Visual cDNA-Array Gene Expression Analysis
|
|
On relationships between gene expression experiments and
the quest for clusters of coexpressed genes.
Marc Strickert (Pattern Recognition Group) / April 2005
|
HiT-MDS
|
|
Scaling (MDS) is a powerful dimension reduction technique for embedding high-dimensional data into a low-dimensional target space.
Thereby, similarities or distance relationships between the source data are reconstructed in the target space as best as possible.
High-Throughput MDS (HiT-MDS) has been designed for dealing with large data sets such as multi-parallel gene expression probes.
Particularly for the task of detecting clusters of coexpressed genes, the combination of HiT-MDS with correlation-based gene
profile comparison proves to be very suitable for visual data inspection.
Multidimensional Scaling (MDS) is a powerful dimension reduction technique for embedding high-dimensional data into a
low-dimensional target space. Thereby, similarities or distance relationships between the source data are reconstructed in the
target space as best as possible. High-Throughput MDS (HiT-MDS) has been designed for dealing with large data sets such as
multi-parallel gene expression probes. Particularly for the task of detecting clusters of coexpressed genes, the combination of
HiT-MDS with correlation-based gene profile comparison proves to be very suitable for visual data inspection.
The preliminary c-sources can be downloaded: hit-mds.tar.gz (205794 bytes).
A presentation from the PlantMetaNet Meeting is available: strickert-HitMDS.pdf
(1,469,179 bytes).
The paper M. Strickert, S. Teichmann, N. Sreenivasulu, and U. Seiffert, 'High-Throughput Multi-Dimensional Scaling (HiT-MDS)
for cDNA-Array Expression Data' submitted to the International Conference of
Artificial Neural Networks (ICANN05) can still be obtained:
hit-mds_submitted2icann05.pdf (231402 bytes).
|
cDNA-Array Expression Analysis
|
|
Two studies are presented that show the suitability of HiT-MDS for the analysis of high-throughput expression data. In the
experimental design several 11786 gene expression patterns were analyzed during barley seed development, 0–26 days after
flowering in steps of 2 days, in three distinct tissues: pericarp (p), endosperm (end), and embryo (e). With 2 independent
experiments per configuration, a total of 62 experiments has been made available.
The two major questions of interest are:
-
How are the experiments, representing the tissues at a particular developmental stage, characterized with respect to
their transcriptome similarity of specifically expressed genes?
-
Can inter-experimental clusters of coexpressed genes be identified for which, afterwards, regulatory functions can be
assigned?
In order to address these key questions it is necessary to bring the high-dimensional data for visualization into lower dimension,
thereby keeping their original distances. HiT-MDS is applied to obtain these visualizations, first for grouping the experiments
with similar transcriptome, secondly for identifying the hot-spots of coexpressed genes from expression data.
|
1. Experiment Clustering
|
|
HiT-MDS is used to embed the 62 available gene expression experiments obtained from cDNA-Arrays with 11786 probed genes into
2D-plots.
|

Figure 1: Standard PCA projection of 62 expression experiments.
|
|
A standard computation is shown in the first figure: This is a plot of the genes projected to the first two eigenvectors.
The squared correlation of the given inter-point distances with the distances in the original 11786-D gene space is r2
= 0.9305. A good grouping of the experiments (e1,e2), the tissues (e,p,end), and the temporal order (numbers)
becomes visible.
|

Figure 2: Euclidean-based HiT-MDS embedding of 62 expression experiments.
|
|
Higher confidence can be put in the second visualization for which the squared correlation between the displayed point distances
and the original distances is r2 = 0.9430. The embedding method spreads the data points more faithfully, without being
constrained by a linear mapping function: any point can be moved freely in order to maximize the inter-distance correlations.
|

Figure 3: Correlation-based HiT-MDS embedding of 62 expression experiments.
|
|
Particularly interesting results can be obtained if the experiments are compared according to the (1-correlation)–measure.
Then the log-transformed cDNA gray-level images are compared invariantly to scaling (histogram stretches) and shifting (light
vs. dark). Again a clear grouping is obtained. Long-term saturation (>=16 DAF) is well emphasized in clusters and the
intermediate transient developmental stages (6,8,10,12 DAF) are given space. The inter-point distance correlation is r2
=.935 which seems low, but which results from the fact that the (1-experiment correlation)–measure is converted into the
plottable Euclidean distances.
Since the minimum dissimilarity is 0 (perfect correlation) and the maximum dissimilarity is 2 (not infinity as in Euclidean terms)
potentially circular shapes will be the result of such a space conversion with many anticorrelated features.
|
1. Gene Clustering
|
|
Gene clustering is a much more demanding task because up to 11786 data points, the gene in their 31 dimensions of experimental
expression levels, are processed. As usual in expression analysis, the expression intensity levels are subject to standard
logarithmic transform.
|

Figure 4: HiT-MDS embedding of 11786 genes from the first series of experiments. Density plot.
|
|
If all genes from one series of experiments are embedded with Euclidean distances, donut shaped regions of high density are
obtained, where more details are to be found. The squared correlation of the visualized reconstruction with the original data
matrix is r2 = 0.958.
|

Figure 5: Gene densities obtained by PCA projection.
|
|
At first glance, the PCA projection looks similar. However, the central region with its many data points is distorted from
donut shape to a two-dot structure. The squared correlation is only r2 =0.878, thus HiT-MDS yields clearly more
faithful representations, i.e. the donut-shape with its 31-dimensional equivalent can be more trusted.
|

Figure 6: Correlation-based HiT-MDS embedding on a subset of differentially expressed genes.
The parabola is used to separate the genes into two clusters that can be further analyzed.
|
|
In order to see how more details are revealed when large groups of anti-correlated genes are separated and further processed
(“zoomed in”), a plot of individual genes is used to find groups of interest. If the high-density cluster of green
points is omitted, then the relaxed embedding constraints yield finer notions of anti-correlatedness for the group of genes
related to the red points.
|

Figure 7: Correlation-based HiT-MDS embedding of the red marked genes from the prior image.
|
|
After reembedding these points, much more details and clusters of coexpressed genes become apparent. The spatial differentiation
is very good: clusters of high density denote similar gene profiles.
|

Figure 8: Gene profiles related to the dark red cluster of Fig.7.
|

Figure 9: Gene profiles related to lower left cluster of Fig.7.
|
Conclusions
|
|
The HiT-MDS is a useful tool in order to obtain good visual representations of multi-parallel cDNA-Array data. Particularly
space conversions from correlatedness into Euclidean proximity help to identify clusters of coexpressed genes and to get more
independent of data shifts and scaling that do often occur in the measuring practice. Therefore, HiT-MDS has many more
applications than the presented one for gene expression analysis.
|
Sanger-Driven MDSLocalize
|
|
MDSLocalize is a SVD-based reconstruction method for converting a given squared NxN distance matrix into a set of d-dimensional
points. Follow the link for the original paper of Drineas
et al.
Usually, a number of d=1,2,or 3 output dimensions is chosen for visualization. The technique made available here, replaces
SVD by PCA which is efficiently approximated by its neural formulation based on Sanger's rule. As pointed out in the draft,
Note that the provided implementation does not yet allow incomplete distance information.
-
PDF draft of paper submitted to ESANN conference before
review changes (670.616 bytes).
-
ZIP archive of MDSLocalize C-Code (9.688 bytes).
|
Related links
|
|
Multidimensional scaling:
Author’s homepage:
Bookmarks on this page should be set to http://hitmds.webhop.net/
|
<<< top >>>
|
|
|
|
Pattern Recognition Group
|
IPK Gatersleben
|
|