Research Methods in BMI Arizona State University: Week of 09/28/09

Content:

The first lecture was on Ontologies, Information models, and the Semantic Web by Dr. Fridsma. The lecture included a description of Tim Berners-Lee. Mr. Berners-Lee among other people invented the World Wide Web (web). Mr. Berners-Lee's vision of the web included an advanced semantic web. That vision of a semantic web is probably too hard to complete now. As an alternative to that semantic web, the web currently extensively uses syntactic representations of language. One method that can help with the addition of semantics to the web is the use of controlled vocabularies. However, enumerations in controlled vocabularies may be highly confusing with large vocabularies. Another way to help add semantics to the web is through the use of a resource description framework (RDF). RDF data models allow for subject, predicate, object connections which add semantic meaning to web pages. The Web Ontology Language (OWL) was created to go beyond RDF in depth of semantic meaning. OWL is available as a full union of OWL syntax and RDF (OWL Full), restricted to first order logic fragment (OWL DL), or an “easier to implement” subset of OWL DL (OWL Lite). I am impressed by the advanced forms of semantic technology that have been designed. I can imagine a future where programs on the web can provide useful responses to questions involving complex semantics. It would be nice to be able to get a computer to effectively interpret a quesiton like “What recent popular books contain science fiction without any metaphysics but with some adventure and current science?” Currently, I can find an answer to such a question as that but an advanced semantic system on the web could make my search for that answer much more efficient.

The second lecture was on machine learning with Shuiwang Ji. Supervised was a type of machine learning that was taught. K-nearest-neighbor (KNN) is a kind of supervised learning. KNN learns by comparing attributes to labels. KNN can use what it has learned from those comparisons to make predictions about labels for new sets of attributes. Similarity measures between attributes can be used to make those predictions. If KNN is being used to analyze attributes with continuous values than euclidean distances can be used to make comparisons of similarity measures of attributes. Other problem specific measures can be used for non-continuous measures. The production of similarity measures by scaling different measures of numbers seems interesting to me. In the lecture, three large attribute numbers and one small attribute number seemed to produce a similarity measure that was the most similar in size to the large numbers. I'm curious to know what mathematical technique was used to scale the attribute measures to create the similarity measurement. I am also interested in knowing how large past training sets of numbers have been that have produced accurate predictions. Perhaps how large training sets should be to be useful for predictions is related to the number of attributes that are compared.

Unsupervised, regression, and semi-supervised were other forms of machine learning that were taught. Two kinds of unsupervised learning methods are flat and hierarchical clustering. Flat clustering includes K-means, spectral, and graph-based clustering. K-means clustering uses centriods to designate clusters. K-means clustering uses an iterative process of reorganizing points to clusters. The centriods can be repositioned for improved clustering in each iteration of a k-means clustering process. Hierarchical clustering includes agglomerative and divisive clustering. Those two clustering techniques are different in the way they start. Agglomerative clustering starts with points as individual clusters. Divisive clustering starts with one all inclusive cluster of points. Machine learning with regression can identify linear or nonlinear forms of dependency in variables with continuous values. That regression has been used in statistics and neural network fields. Semi-supervised clustering can use a mixture of labeled and unlabeled data. Semi-supervised learning has been used for gene/protein function classifications as well as other applications. Additionally, the validities of clustering techniques have been tested with a variety of measures.

Putting semi-supervised learning in the context of genetics research and specifically gene/protein function predictions was interesting to me. I can imagine how attributes of genetic mutations could be clustered into groups of protein functions. It would be interesting to predict what gene mutations would lead to different protein functions. The connections between protein functions and genes could be further understood by examining what changes in gene sequences lead to different protein functions. For example, if a genetic mutation caused a large difference in protein function then that genetic mutation could be crucial to the protein's function. I found an article that explored some genetics research with a semi-supervised learning method. Here is a link to the article on using a semi-supervised learning method to predict transcription factor-gene interactions in Escherichia coli:
http://www.pubmedcentral.nih.gov.ezproxy1.lib.asu.edu/articlerender.fcgi?tool=pubmed&pubmedid=18369434

Posted by:

Nate

Research Methods in BMI Arizona State University

Friday, October 2, 2009

Week of 09/28/09

No comments:

Post a Comment

Followers

Blog Archive

Contributors