yatel.cluster package

Submodules

yatel.cluster.kmeans module

The Yatel kmeans algorithm clusters a network’s environments, using as dimensions the haplotypes which exists in each environment or arbitrary values computed over them.

For more information about kmeans:

yatel.cluster.kmeans.hap_in_env_coords(nw, env)[source]

Generates the coordinates for the kmeans algorithm with the existences of haplotypes in the environment.

Parameters:

nw : yatel.db.YatelNetwork

env : a collection of dict or yatel.dom.Enviroment

Returns:

array : arrays of arrays

The returned coordinates has M elements (M is the number of haplotypes in the network) with same order of yatel.db.YatelNetwork.haplotypes_ids function with 2 posible values:

  • 0 if the haplotype doesn´t exist in the environment.
  • 0 if the haplotype exist in the environment.
yatel.cluster.kmeans.kmeans(nw, envs, k_or_guess, whiten=False, coordc=None, *args, **kwargs)[source]

Performs k-means on a set of all environments defined by fact_attrs of a network.

Parameters:

nw : yatel.db.YatelNetwork

Network source of environments to classify.

envs : iterable of yatel.dom.Environments or dicts

Represents all the environments to be clustered.

k_or_guess : int or ndarray

The number of centroids to generate. A code is assigned to each centroid, which is also the row index of the centroid in the code_book matrix generated.

The initial k centroids are chosen by randomly selecting observations from the observation matrix. Alternatively, passing a k by N array specifies the initial k centroids.

whiten : bool

execute scipy.cluster.vq.whiten function over the observation array before executing subjacent scipy kmeans.

coordc : None or callable

If coordc is None generates use hap_in_env_coords function. Otherwise coordc must be a callable with 2 arguments:

  • nw network source of environments to classify.
  • env the environment to calculate the coordinates

and must return an array of coordinates for the given network environment.

args : arguments for scipy kmeans

kwargs : keywords arguments for scipy kmeans

Returns:

coodebook : an array kxn of k centroids

A k by N array of k centroids. The i’th centroid codebook[i] is represented with the code i. The centroids and codes generated represent the lowest distortion seen, not necessarily the globally minimal distortion.

distortion : the value of the distortion

The distortion between the observations passed and the centroids generated.

Examples

>>> from yatel import nw
>>> from yatel.cluster import kmeans
>>> nw = db.YatelNetwork('memory', mode=db.MODE_WRITE)
>>> nw.add_elements([dom.Haplotype(1), dom.Haplotype(2), dom.Haplotype(3)])
>>> nw.add_elements([dom.Fact(1, att0=True, att1=4),
...                  dom.Fact(2, att0=False),
...                  dom.Fact(2, att0=True, att2="foo")])
>>> nw.add_elements([dom.Edge(12, 1, 2),
...                  dom.Edge(34, 2, 3),
...                  dom.Edge(1.25, 3, 1)])
>>> nw.confirm_changes()
>>> kmeans.kmeans(nw, nw.enviroments(["att0", "att2"]), 2)
(array([[1, 0, 0],
       [0, 1, 0]]),
 0.0,
 (({u'att0': True, u'att2': None},),
  ({u'att0': False, u'att2': None}, {u'att0': True, u'att2': u'foo'})))
>>> calc = lambda nw, env: [stats.average(nw, env), stats.std(nw, env)]
>>> kmeans.kmeans(nw, ["att0", "att2"], 2, coordc=calc)
(array([[ 23.   ,  11.   ],
       [  6.625,   5.375]]),
 0.0)
yatel.cluster.kmeans.nw2obs(nw, envs, whiten=False, coordc=None)[source]

Converts any given environments defined by fact_attrs of a network to an observation matrix to cluster with subjacent scipy kmeans

Parameters:

nw : yatel.db.YatelNetwork

Network source of environments to classify.

envs : iterable of yatel.dom.Enviroment or dicts

Represent all the environment to be clustered.

whiten : bool

execute scipy.cluster.vq.whiten function over the observation array before executing subjacent scipy kmeans.

coordc : None or callable

If coordc is None generates use hap_in_env_coords function. Otherwise coordc must be a callable with 2 arguments:

  • nw network source of environments to classify.
  • env the environment to calculate the coordinates

and must return an array of coordinates for the given network environment.

Returns:

obs : a vector of envs

Each I’th row of the M by N array is an observation vector of the I’th environment of envs.

Examples

>>> from yatel import nw
>>> from yatel.cluster import kmeans
>>> nw = db.YatelNetwork('memory', mode=db.MODE_WRITE)
>>> nw.add_elements([dom.Haplotype(1), dom.Haplotype(2), dom.Haplotype(3)])
>>> nw.add_elements([dom.Fact(1, att0=True, att1=4),
...                  dom.Fact(2, att0=False),
...                  dom.Fact(2, att0=True, att2="foo")])
>>> nw.add_elements([dom.Edge(12, 1, 2),
...                  dom.Edge(34, 2, 3),
...                  dom.Edge(1.25, 3, 1)])
>>> nw.confirm_changes()
>>> kmeans.nw2obs(nw, nw.enviroments(["att0", "att2"]))
array([[1, 0, 0],
       [0, 1, 0],
       [0, 1, 0]])

Module contents

This package contains utilities for environment clusterization.