Skip to main content

Abstracts - July 8h, 2010

Go back to the program

  • Mining Context-based Logical Sequences? (Adriana Bechara Prado, LaHC, Université Jean Monnet, Saint-Etienne).

The purpose of this talk is to discuss the implications of mining logical sequences when they are associated with circumstance information, i.e., the context in which the sequences arise.

  • Mining Probabilistic Databases (Toon Calders, Technical University Eindhoven, The Netherlands)

Recently we have seen a number of papers that study mining probabilistic databases. In these probabilistic databases uncertainty is associated with the observations; e.g., with every item in a transactional database a probability is associated. Many of these proposals assume, implicitly or explicitly, that the underlying uncertainty model is an independence model (e.g., [1-7] and many many more) in the sense that the uncertainties associated with the different observations are independent. Complicated extensions to existing algorithms are made in order to make them suitable for itemset mining in uncertain data.

In my presentation I will show that for the independence model there already exist excellent statistical tools to handle these cases. To illustrate my point I will show how the frequentness probability of [2] can be computed using a weak form of the Central Limit Theorem. The bottom line of my presentation will be that complicated exact methods for computing frequentness probabilities are as useful as exact methods to compute the probability of having 352,015 or more heads when we toss a coin 700,000 times.

[1] Charu C. Aggarwal, Yan Li, Jianyong Wang, and Jing Wang. Frequent pattern mining with uncertain data. In Proc. of KDD ’09, pages 29–38, New York, NY, USA, 2009. ACM.
[2] T. Bernecker, HP Kriegel, M Renz, Florian Verhein, Andreas Zuefle. Probabilistic frequent itemset mining in uncertain data. In ACM SIGKDD, 2009
[3] Chun K. Chui and Ben Kao. A decremental approach for mining frequent itemsets from uncertain data. In Washio et al. PaKDD, pages 64–75. 2008
[4] Chun K. Chui, Ben Kao, and Edward Hung. Mining frequent itemsets from uncertain data. In Zhi-Hua Zhou, Hang Li, and Qiang Yang, editors, PAKDD, volume 4426 of Lecture Notes in Computer Science, pages 47–58. Springer,  2007.
[5] CKS Leung, DA Brajczuk. Efficient algorithms for the mining of constrained frequent patterns from uncertain data. In: ACM SIGKDD Explorations Newsletter, 2010
[6] CKS Leung, DA Brajczuk. Mining uncertain data for constrained  frequent sets. In: Proceedings of IDEAS, 2009
[7] Carson Kai-Sang Leung, Mark Anthony F. Mateo, and Dale A. Brajczuk. A tree-based approach for frequent pattern mining from uncertain data. In PaKDD, pages 653–661, 2008.

  • Learning and Predicting the Evolution of Social Networks (Francesco Bonchi, Yahoo! Research Barcelona, Spain)

With the increasing availability of large social networks, there is an increase need for graph mining tools able to extract interesting and actionable knowledge from this wealth of data. In this talk, we will address the following questions: Can we learn a set of rules describing how social networks evolve over time? Can we apply these rules to predict the future evolution of social networks? Given a sequence of snapshots of an evolving network, we aim at learning the rules describing the local changes occurring in the network. Following the paradigm of association rules we define the problem of mining graph-evolution rules that satisfy a minimum support and a minimum confidence constraint. We then apply the extracted rules to the problem of link prediction.

  • DL and Lossy Compression (Arno Siebes, Universiteit Utrecht, The Netherlands).

When presenting our research on MDL for data mining an often recurring question is: but what if you would allow lossy compression? In this talk I will show how we can achieve lossy compression while remaining faithfull to the MDL paradigm (i.e., lossless compression)! Moreover, to honour the occasion and the hosts, I will show how this approach allows us to retrieve as small set of fault-tolerant (or noise tolerant) patterns .

  • Constraint-Based Mining of Closed Patterns in Noisy n-ary Relations (Loïc Cerf, LIRIS, INSA Lyon)

Useful knowledge discovery processes can be based on patterns extracted from large datasets. Designing efficient data mining algorithms to compute collections of relevant patterns is an active research domain. Many datasets record whether some properties hold for some objects, e. g., whether an item is bought by a customer or whether a gene is over-expressed in a biological sample. Such datasets are binary relations and can be represented as 0/1 matrices. In such matrices, a closed itemset is a maximal rectangle of '1's modulo arbitrary permutations of the lines (objects) and the columns (properties). Thus, every closed itemset supports the discovery of a maximal subset of objects sharing the same maximal subset of properties. Efficiently extracting every closed itemset satisfying user-defined relevancy constraints has been extensively studied. Despite its success across many application domains, this framework often turns out to be too narrow. First of all, many datasets are n-ary relations, i. e., 0/1 tensors. Reducing their analysis to two dimensions is ignoring potentially interesting additional dimensions, e. g., where a customer buys an item (localized analysis) or when a gene expression is measured (kinetic analysis). The presence of noise in most datasets is a second issue closed itemset mining does not properly address. Indeed, closed itemsets only cover tuples present in the relation.
  

Generalizing the definition of a closed itemset to make it suit relations of higher arity and tolerate some noise is straightforward (maximal hyper-rectangle with an upper bound of '0's tolerated per hyper-plan). On the contrary, generalizing their extraction is very hard. Indeed, classical algorithms exploit a mathematical property (the Galois connection) of the closed itemsets that none of the two generalizations preserve. That is why our extractor browses the candidate pattern space in an original way that does not favor any dimension. This search can be guided by a very broad class of relevancy constraints the patterns must satisfy. In particular, this thesis studies constraints specifically designed for mining almost-persistent cliques in dynamic graphs. Our extractor is orders of magnitude faster than known competitors focusing on exact patterns in ternary relations or on noise-tolerant patterns in binary relations. Despite these results, such an exhaustive approach often cannot, in a reasonable time, tolerate as much noise as the dataset contains. In this case, complementing the extraction with a hierarchical agglomeration of the (insufficiently noise-tolerant) patterns increases the quality of the returned collection of patterns.