****************************************************************************
SMBio 2008 - 2010
****************************************************************************

Contact: peggy.cellier@info.unicaen.fr

The SMBio program (Sequential Mining for Biomedical texts) is a command line tool to discover sequential patterns of gene interaction in biomedical texts.
It is the Del n. 8, Task 2.2: "Learning IE rules" of ANR Project BINGO2 (http://bingo2.greyc.fr/).
The executable is distributed as a LINUX executable.

Authors: Peggy Cellier, Thierry Charnois (GREYC, Caen, France), Marc Plantevit and Christophe Rigotti (LIRIS, Lyon, France)

TOC:

1 Introduction
2 Definitions
3 Requirement
4 The options of SMBio
5 Examples of commands
6 Input format
7 Debug mode
8 References


****************************************************************************
1 Introduction
****************************************************************************

Details about the process can be found in papers:

(in French)
M. Plantevit and T. Charnois. 
Motifs séquentiels pour l'extraction d'information : illustration sur le problème de la détection d'interactions entre gènes. 
Traitement Automatique des Langues Naturelles (TALN’09), Senlis, France, Juin 2009.

(in English)
P. Cellier, T. Charnois and M. Plantevit. 
Sequential Patterns to Discover and Characterise Biological Relations. 
International Conference on Intelligent Text Processing and Computational Linguistics (CICLing’10), Iasi, Roumanie, March 2010.

****************************************************************************
2 Definitions
****************************************************************************

Sequential pattern mining [AgrawalSrikant95] is a data mining technique 
that aims at discovering correlations between events through their 
order of appearance. Sequential pattern mining is an important 
field of data mining with broad applications.
The algorithm which is used there to compute frequent sequential pattern 
is dmt4 [NanniRigotti07].

SEQUENCE/ITEM:
------------------
In the context of the sequential patterns extraction, a sequence is an 
ordered list of distinct literals called items.  
A sequence S is denoted by ( i_1 i_2 ... i_n )
where i_k, 1 <= k <= n, is an item.

SUBSEQUENCE/SUPERSEQUENCE:
-------------------------
Let S_1 = ( i_1 i_2 ... i_n ) and S_2 = ( i'_1 i'_2 ... i'_m ) be two sequences.  
S_1 is included in S_2 if there exist integers 1 <= j_1 < j_2 < ... < j_n <= m
such that i_1 = i'_{j_1}, i_2 = i'_{j_2}, ..., i_n = i'_{j_n}.
S_1 is called a subsequence of S_2.  
S_2 is called a super-sequence of S_1. 

MAXIMALITY:
-----------
An extracted sequential pattern, S_1, is maximal if there is no other 
extracted sequential pattern, S_2.

SEQUENCE DATABASE:
------------------
A sequence database SDB is a set of tuples (sid, S), where sid
is a sequence ID and S a sequence.  A tuple (sid, S)
contains a sequence S1, if S1 is a
subsequence of S.

SUPPORT:
--------
The support of a sequence S1 in a sequence database
SDB is the number of tuples in the database containing S1.
Note that we do not mention the database when it is clear from the context
(sup(S1)).  

CONSTRAINED-BASED MINING:
------------------------
The constraint-based pattern mining framework is a paradigm
to discover new knowledge [NgLakshmananHanPang98].  Constraints
provide a focus on the most promising knowledge by reducing the number
of extracted patterns to those of potential interest for the
user. More precisely, constraint-based mining task selects all the
sequential patterns included in SDB and satisfying a predicate which
is named constraint. There are a lot of constraints to evaluate
the relevance of sequential patterns. The most known example is likely
the frequency constraint.  Given a mininimum support threshold
minsup, the problem of frequent sequential pattern mining is to find
the complete set of sequential patterns whose support is greater or
equal to minsup. There are many other constraints highlighting the
best sequential patterns with respect to the user 
objectives [PeiHanLakshmanan01]. We use both syntactic constraints 
and constraints coming from linguistic information.

RECURSIVE MINING:
-----------------
Recursive pattern mining [Soulet07] is a process that gives prominence
to the most significant patterns and filters the specific ones.
The key idea of recursive pattern mining is to repeat the pattern mining
process on the output in order to reduce it until few and significant
patterns are obtained. That recursive process is ended when the result
becomes stable.  The final recursive patterns bring forward information
coming from each mining step. More precisely, recursive pattern produces a
k-summary (i.e., a set with at most k patterns) summarizing the data
according to a measure (e.g., frequency, growth rate) where k is a given
number. One of the advantages of the recursive pattern mining is that the
number of returned frequent patterns is well-mastered.

****************************************************************************
3 Requirement
****************************************************************************

Java 1.5 (or more recent versions)

****************************************************************************
4 The options of SMBio
****************************************************************************

------------------------------------------------------------
General options

-i FILENAME : Input file (a corpus that was previoulsy POS tagged)

------------------------------------------------------------
Options for sequence mining 

-f MINSUP : threshold for the support (by default 10)

------------------------------------------------------------
Options from the POS Tagging step

-verb POS_VERB : tag of verbs (by default @V)
-noun POS_NOUN : tag of nouns (by default @N)
-pnoun POS_PROPER_NOUN : tag of proper nouns(by default @NP)
-key NAMED_ENTITY : tag of the named entity (by default AGENE)

------------------------------------------------------------
Options fo recursive mining

-k K : k value for the recursive mining (i.e. maximal number of patterns by verb or noun)

------------------------------------------------------------
Options for debugging

-debug BOOLVAL : activates the debug mode when BOOLVAL=true and does not delete temporary folders and files (by default false)

****************************************************************************
5 Examples of commands
****************************************************************************

A toy input file is given as examples/corpus.txt on which you can try the following commands:

Find IE rules for minsup=5:
java smBio.SMBio -i examples/corpus.txt -f 5

Find IE rules for k=4:
java smBio.SMBio -i examples/corpus.txt -k 4

Find IE rules in debug mode:
java smBio.SMBio -i examples/corpus.txt -debug true
Note: Several directories are generated in the folder examples/ (see Section 7).

****************************************************************************
6 Input format
****************************************************************************

The corpus have to be previously POS tagged, for example, by treetagger (http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/).
The item has to be lemma@POStag.
The POStag value for nouns and verbs can be parameterized thanks to the options -verb, -noun and -pnoun. 
There is one sequence by line of the input file.

For example, in the following there are 3 sequences:

mutation@NNS in@IN the@DT human@JJ  AGENE@NP  ,@,  AGENE@NP  and@CC  AGENE@NP  gene@NNS be@VBP responsible@JJ .@SENT 
Germline@NP mutation@NNS of@IN the@DT three@CD succinate@NN dehydrogenase@NN subunits@NN  AGENE@NP  ,@,  AGENE@NP  and@CC  AGENE@NP  have@VHP recently@RB be@VBN associate@VVN with@IN familial@JJ pheochromocytoma@NN and@CC paraganglioma@NN .@SENT 
therefore@RB ,@, we@PP attempt@VVD to@TO determine@VV whether@IN the@DT tumor@NN suppressor@NN gene@NNS  AGENE@NP  ,@,  AGENE@NP  and@CC  AGENE@NP  be@VBP involve@VVN in@IN sporadic@JJ and@CC familial@JJ MTC@NP .@SENT 

****************************************************************************
7 Debug mode
****************************************************************************

Hierarchy of the examples folder when debug mode is activated:

examples/
	corpus.txt
	corpus.txt_tableCorrespondances.txt
	corpus.txt_CorpusDmt4.txt
	corpus.txt_all_freq_seq_pat_dmt4_minsup_10
	corpus.txt_constraints/
	corpus.txt_frequent_max_constrained_patterns/
	corpus.txt_recursive/ 
		CorpusDMT4/  
		RSP/
		RSP_MAX/
		RSP_S/
		TabCorr/
	corpus.txt_results/


corpus.txt : input file
corpus.txt_CorpusDmt4.txt                     : corpus.txt in dmt4 format, it means that each item of corpus.txt is replaced by an integer
corpus.txt_tableCorrespondances.txt           : correspondance between integers and items in natural language
corpus.txt_all_freq_seq_pat_dmt4_minsup_10    : output file of dmt4, i.e. all frequent sequential patterns for minsup=10
corpus.txt_constraints/
	AGENE_seqPat_W2G_minsup_10.txt        : all frequent sequential patterns that contain at least 2 genes
	etiq_Noun_minsup_10.txt               : integers that correspond to a noun in natural language
	etiq_Verb_minsup_10.txt               : integers that correspond to a verb in natural language
	AGENE_seqPat_2G1N_minsup_10.txt       : all frequent sequential patterns that contain at least 2 genes and at least 1 noun
	AGENE_seqPat_2G1V_minsup_10.txt       : all frequent sequential patterns that contain at least 2 genes and at least 1 verb
	AGENE_seqPat_2G1N_Max_minsup_10.txt   : all maximal frequent sequential patterns that contain at least 2 genes and at least 1 noun
	AGENE_seqPat_2G1V_Max_minsup_10.txt   : all maximal frequent sequential patterns that contain at least 2 genes and at least 1 verb
	_seqPat_nouns_max_tal_minsup_10.txt   : all maximal frequent sequential patterns that contain at least 2 genes and at least 1 noun in natural language
	_seqPat_verbs_max_tal_minsup_10.txt   : all maximal frequent sequential patterns that contain at least 2 genes and at least 1 verb in natural language
	_nouns_list.txt                       : list of nouns that are in _episodes_noms_max_tal_minsup_10.txt
	_verbs_list.txt	                      : list of verbs that are in _episodes_verbes_max_tal_minsup_10.txt
corpus.txt_frequent_max_constrained_patterns/ : all maximal frequent sequential patterns that contain at least 2 genes and at least 1 noun/verb group by verb and noun
corpus.txt_recursive
	CorpusDMT4/  : for each group the file in dmt4 format
	TabCorr/     : correspondance between integers and items in natural language
	RSP/         : frequent patterns
	RSP_MAX/     : maximal frequent patterns
	RSP_S/       : maximal frequent patterns that contain at least 2 genes and at least 1 noun/verb
corpus.txt_results/  : maximal frequent sequential patterns group by verb and noun s.t. there are no more than k patterns by group.


****************************************************************************
8 References
****************************************************************************

[AgrawalSrikant95] Agrawal, R. and Srikant, R. (1995) "Mining sequential
patterns", Proc. of the 11th Int. Conf. on Data Engineering (ICDE'95),
pp.3­14.

[NanniRigotti07] Nanni, M. and Rigotti, C. (2007) "Extracting Trees of 
Quantitative Serial Episodes", Knowledge Discovery in Inductive Databases 
5th Int. Workshop (KDID'06), Revised Selected and Invited Papers, 
LNCS 4747, pp.170.

[NgLakshmananHanPang98] Ng, R.T., Lakshmanan, V.S., Han, J. and Pang,
A. (1998) "Exploratory mining and pruning optimizations of constrained
associations rules", Proceedings of ACM SIGMOD'98, ACM Press, pp.13­24.

[PeiHanLakshmanan01] J. Pei, J. Han, and L. V. S. Lakshmanan "Mining
frequent itemsets with convertible constraints. In ICDE, 2001".

[Soulet07] Soulet, A. (2007). Résumer les contrastes par l'extraction
récursive de motifs. Conférence sur l'Apprentissage Automatique (CAp'07)
(pp. 339­354). Grenoble, France. Cépaduès Edition.


****************************************************************************
THE END
****************************************************************************