Cookies?
Library Header Image
LSE Research Online LSE Library Services

Quelques stratégies pour l’exploitation de gros corpus en analyse des données textuelles

Lahlou, Saadi and Folch, Helka (1998) Quelques stratégies pour l’exploitation de gros corpus en analyse des données textuelles. In: Mellet, Sylvie, (ed.) Jadt 1998. 4èmes Journées Internationales D'analyse Statistique des Données Textuelles. Université de Nice Sophia Antipolis, Nice, pp. 381-390. ISBN 2864840057

[img]
Preview
PDF
Download (202Kb) | Preview

Abstract

Our work carried out as part of the Scriptorium project has confronted us with a variety of problems that have to be faced by analysts engaging in text-mining as applied to large heterogeneous corpora (intranet, www, document-based DB). This paper presents several solutions concerning the following points : the extraction of relevant sub-sections of the corpus, meta-data, efficient storage, historisation. We introduce two original solutions : document storage based on collections of self-describing texts with embedded meta-data in the form of mark-up (instead of a DBMS or file-based approach : full text indexing at such a scale is heavy) ; use of an extractor based on the software product TOPIC to retrieve relevant paragraphs and assemble them into homogeneous sub-corpora of exploitable size (< 10 Mega). We shall also describe the strategies we have adopted for comparing different analyses of the corpus in a historical perspective, in particular the transformation of ALCESTE class profiles into TOPIC concepts aimed at providing fixed, quantifiable measurements of the density of certain topics in the texts. This paper was given at the 4èmes Journées Internationales d’Analyse des Donnés Textuelles. Nice, France, 18-21 février 1998. It was published in the proceedings volume and is freely available at http://www.cavi.univ-paris3.fr/lexicometrica/jadt/jadt1998/JADT1998.htm

Item Type: Book Section
Official URL: http://www.cavi.univ-paris3.fr/lexicometrica/jadt/...
Additional Information: © 1998 Université de Nice Sophia Antipolis
Library of Congress subject classification: B Philosophy. Psychology. Religion > BF Psychology
P Language and Literature > P Philology. Linguistics
Sets: Departments > Social Psychology
Rights: http://www.lse.ac.uk/library/usingTheLibrary/academicSupport/OA/depositYourResearch.aspx
Date Deposited: 07 Mar 2011 14:32
URL: http://eprints.lse.ac.uk/33005/

Actions (login required)

Record administration - authorised staff only Record administration - authorised staff only