**************************************************************************** SMBio 2008 - 2010 **************************************************************************** Contact: peggy.cellier@info.unicaen.fr The SMBio program (Sequential Mining for Biomedical texts) is a command line tool to discover sequential patterns of gene interaction in biomedical texts. It is the Del n. 8, Task 2.2: "Learning IE rules" of ANR Project BINGO2 (http://bingo2.greyc.fr/). The executable is distributed as a LINUX executable. Authors: Peggy Cellier, Thierry Charnois (GREYC, Caen, France), Marc Plantevit and Christophe Rigotti (LIRIS, Lyon, France) TOC: 1 Introduction 2 Definitions 3 Requirement 4 The options of SMBio 5 Examples of commands 6 Input format 7 Debug mode 8 References **************************************************************************** 1 Introduction **************************************************************************** Details about the process can be found in papers: (in French) M. Plantevit and T. Charnois. Motifs séquentiels pour l'extraction d'information : illustration sur le problème de la détection d'interactions entre gènes. Traitement Automatique des Langues Naturelles (TALN’09), Senlis, France, Juin 2009. (in English) P. Cellier, T. Charnois and M. Plantevit. Sequential Patterns to Discover and Characterise Biological Relations. International Conference on Intelligent Text Processing and Computational Linguistics (CICLing’10), Iasi, Roumanie, March 2010. **************************************************************************** 2 Definitions **************************************************************************** Sequential pattern mining [AgrawalSrikant95] is a data mining technique that aims at discovering correlations between events through their order of appearance. Sequential pattern mining is an important field of data mining with broad applications. The algorithm which is used there to compute frequent sequential pattern is dmt4 [NanniRigotti07]. SEQUENCE/ITEM: ------------------ In the context of the sequential patterns extraction, a sequence is an ordered list of distinct literals called items. A sequence S is denoted by ( i_1 i_2 ... i_n ) where i_k, 1 <= k <= n, is an item. SUBSEQUENCE/SUPERSEQUENCE: ------------------------- Let S_1 = ( i_1 i_2 ... i_n ) and S_2 = ( i'_1 i'_2 ... i'_m ) be two sequences. S_1 is included in S_2 if there exist integers 1 <= j_1 < j_2 < ... < j_n <= m such that i_1 = i'_{j_1}, i_2 = i'_{j_2}, ..., i_n = i'_{j_n}. S_1 is called a subsequence of S_2. S_2 is called a super-sequence of S_1. MAXIMALITY: ----------- An extracted sequential pattern, S_1, is maximal if there is no other extracted sequential pattern, S_2. SEQUENCE DATABASE: ------------------ A sequence database SDB is a set of tuples (sid, S), where sid is a sequence ID and S a sequence. A tuple (sid, S) contains a sequence S1, if S1 is a subsequence of S. SUPPORT: -------- The support of a sequence S1 in a sequence database SDB is the number of tuples in the database containing S1. Note that we do not mention the database when it is clear from the context (sup(S1)). CONSTRAINED-BASED MINING: ------------------------ The constraint-based pattern mining framework is a paradigm to discover new knowledge [NgLakshmananHanPang98]. Constraints provide a focus on the most promising knowledge by reducing the number of extracted patterns to those of potential interest for the user. More precisely, constraint-based mining task selects all the sequential patterns included in SDB and satisfying a predicate which is named constraint. There are a lot of constraints to evaluate the relevance of sequential patterns. The most known example is likely the frequency constraint. Given a mininimum support threshold minsup, the problem of frequent sequential pattern mining is to find the complete set of sequential patterns whose support is greater or equal to minsup. There are many other constraints highlighting the best sequential patterns with respect to the user objectives [PeiHanLakshmanan01]. We use both syntactic constraints and constraints coming from linguistic information. RECURSIVE MINING: ----------------- Recursive pattern mining [Soulet07] is a process that gives prominence to the most significant patterns and filters the specific ones. The key idea of recursive pattern mining is to repeat the pattern mining process on the output in order to reduce it until few and significant patterns are obtained. That recursive process is ended when the result becomes stable. The final recursive patterns bring forward information coming from each mining step. More precisely, recursive pattern produces a k-summary (i.e., a set with at most k patterns) summarizing the data according to a measure (e.g., frequency, growth rate) where k is a given number. One of the advantages of the recursive pattern mining is that the number of returned frequent patterns is well-mastered. **************************************************************************** 3 Requirement **************************************************************************** Java 1.5 (or more recent versions) **************************************************************************** 4 The options of SMBio **************************************************************************** ------------------------------------------------------------ General options -i FILENAME : Input file (a corpus that was previoulsy POS tagged) ------------------------------------------------------------ Options for sequence mining -f MINSUP : threshold for the support (by default 10) ------------------------------------------------------------ Options from the POS Tagging step -verb POS_VERB : tag of verbs (by default @V) -noun POS_NOUN : tag of nouns (by default @N) -pnoun POS_PROPER_NOUN : tag of proper nouns(by default @NP) -key NAMED_ENTITY : tag of the named entity (by default AGENE) ------------------------------------------------------------ Options fo recursive mining -k K : k value for the recursive mining (i.e. maximal number of patterns by verb or noun) ------------------------------------------------------------ Options for debugging -debug BOOLVAL : activates the debug mode when BOOLVAL=true and does not delete temporary folders and files (by default false) **************************************************************************** 5 Examples of commands **************************************************************************** A toy input file is given as examples/corpus.txt on which you can try the following commands: Find IE rules for minsup=5: java smBio.SMBio -i examples/corpus.txt -f 5 Find IE rules for k=4: java smBio.SMBio -i examples/corpus.txt -k 4 Find IE rules in debug mode: java smBio.SMBio -i examples/corpus.txt -debug true Note: Several directories are generated in the folder examples/ (see Section 7). **************************************************************************** 6 Input format **************************************************************************** The corpus have to be previously POS tagged, for example, by treetagger (http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/). The item has to be lemma@POStag. The POStag value for nouns and verbs can be parameterized thanks to the options -verb, -noun and -pnoun. There is one sequence by line of the input file. For example, in the following there are 3 sequences: mutation@NNS in@IN the@DT human@JJ AGENE@NP ,@, AGENE@NP and@CC AGENE@NP gene@NNS be@VBP responsible@JJ .@SENT Germline@NP mutation@NNS of@IN the@DT three@CD succinate@NN dehydrogenase@NN subunits@NN AGENE@NP ,@, AGENE@NP and@CC AGENE@NP have@VHP recently@RB be@VBN associate@VVN with@IN familial@JJ pheochromocytoma@NN and@CC paraganglioma@NN .@SENT therefore@RB ,@, we@PP attempt@VVD to@TO determine@VV whether@IN the@DT tumor@NN suppressor@NN gene@NNS AGENE@NP ,@, AGENE@NP and@CC AGENE@NP be@VBP involve@VVN in@IN sporadic@JJ and@CC familial@JJ MTC@NP .@SENT **************************************************************************** 7 Debug mode **************************************************************************** Hierarchy of the examples folder when debug mode is activated: examples/ corpus.txt corpus.txt_tableCorrespondances.txt corpus.txt_CorpusDmt4.txt corpus.txt_all_freq_seq_pat_dmt4_minsup_10 corpus.txt_constraints/ corpus.txt_frequent_max_constrained_patterns/ corpus.txt_recursive/ CorpusDMT4/ RSP/ RSP_MAX/ RSP_S/ TabCorr/ corpus.txt_results/ corpus.txt : input file corpus.txt_CorpusDmt4.txt : corpus.txt in dmt4 format, it means that each item of corpus.txt is replaced by an integer corpus.txt_tableCorrespondances.txt : correspondance between integers and items in natural language corpus.txt_all_freq_seq_pat_dmt4_minsup_10 : output file of dmt4, i.e. all frequent sequential patterns for minsup=10 corpus.txt_constraints/ AGENE_seqPat_W2G_minsup_10.txt : all frequent sequential patterns that contain at least 2 genes etiq_Noun_minsup_10.txt : integers that correspond to a noun in natural language etiq_Verb_minsup_10.txt : integers that correspond to a verb in natural language AGENE_seqPat_2G1N_minsup_10.txt : all frequent sequential patterns that contain at least 2 genes and at least 1 noun AGENE_seqPat_2G1V_minsup_10.txt : all frequent sequential patterns that contain at least 2 genes and at least 1 verb AGENE_seqPat_2G1N_Max_minsup_10.txt : all maximal frequent sequential patterns that contain at least 2 genes and at least 1 noun AGENE_seqPat_2G1V_Max_minsup_10.txt : all maximal frequent sequential patterns that contain at least 2 genes and at least 1 verb _seqPat_nouns_max_tal_minsup_10.txt : all maximal frequent sequential patterns that contain at least 2 genes and at least 1 noun in natural language _seqPat_verbs_max_tal_minsup_10.txt : all maximal frequent sequential patterns that contain at least 2 genes and at least 1 verb in natural language _nouns_list.txt : list of nouns that are in _episodes_noms_max_tal_minsup_10.txt _verbs_list.txt : list of verbs that are in _episodes_verbes_max_tal_minsup_10.txt corpus.txt_frequent_max_constrained_patterns/ : all maximal frequent sequential patterns that contain at least 2 genes and at least 1 noun/verb group by verb and noun corpus.txt_recursive CorpusDMT4/ : for each group the file in dmt4 format TabCorr/ : correspondance between integers and items in natural language RSP/ : frequent patterns RSP_MAX/ : maximal frequent patterns RSP_S/ : maximal frequent patterns that contain at least 2 genes and at least 1 noun/verb corpus.txt_results/ : maximal frequent sequential patterns group by verb and noun s.t. there are no more than k patterns by group. **************************************************************************** 8 References **************************************************************************** [AgrawalSrikant95] Agrawal, R. and Srikant, R. (1995) "Mining sequential patterns", Proc. of the 11th Int. Conf. on Data Engineering (ICDE'95), pp.3­14. [NanniRigotti07] Nanni, M. and Rigotti, C. (2007) "Extracting Trees of Quantitative Serial Episodes", Knowledge Discovery in Inductive Databases 5th Int. Workshop (KDID'06), Revised Selected and Invited Papers, LNCS 4747, pp.170. [NgLakshmananHanPang98] Ng, R.T., Lakshmanan, V.S., Han, J. and Pang, A. (1998) "Exploratory mining and pruning optimizations of constrained associations rules", Proceedings of ACM SIGMOD'98, ACM Press, pp.13­24. [PeiHanLakshmanan01] J. Pei, J. Han, and L. V. S. Lakshmanan "Mining frequent itemsets with convertible constraints. In ICDE, 2001". [Soulet07] Soulet, A. (2007). Résumer les contrastes par l'extraction récursive de motifs. Conférence sur l'Apprentissage Automatique (CAp'07) (pp. 339­354). Grenoble, France. Cépaduès Edition. **************************************************************************** THE END ****************************************************************************