A Document Weighted Approach for Gender and Age Prediction Based on Term Weight Measure

Authors

1 Department of Information Technology,Vardhaman College of Engineering, Hyderabad, Telangana, India

2 Department of Computer Science and Engineering, JNTUH College of Engineering, Jagtiyal, Karimnagar, Telangana, India

3 Department of Computer Science and Engineering, Matrusri Engineering college, Hyderabad, Telangana, India

Abstract

Author profiling is a text classification technique, which is used to predict the profiles of unknown text by analyzing their writing styles. Author profiles are the characteristics of the authors like gender, age, nativity language, country and educational background. The existing approaches for Author Profiling suffered from problems like high dimensionality of features and fail to capture the relationship between the features. In this work, a new document weighted approach is proposed in order to address the problems in existing approaches. In this approach, the term weight measure is used to assign suitable weight to the terms and these term weights are aggregated to compute the document weight. The classification model is generated with these document weights for predicting profiles of the text. The proposed approach and existing approaches are experimented on reviews domain with different classifiers. The accuracies of the proposed approach for gender and age prediction are promising than existing approaches.

Keywords


1.     Koppel, M., Argamon, S. and Shimoni, A.R., "Automatically categorizing written texts by author gender", Literary and Linguistic Computing,  Vol. 17, No. 4, (2002), 401-412.

2.     Schler, J., Koppel, M., Argamon, S. and Pennebaker, J.W., "Effects of age and gender on blogging", in AAAI spring symposium: Computational approaches to analyzing weblogs. Vol. 6, (2006), 199-205.

3.     Nerbonne, J., The secret life of pronouns. What our words say about us. 2013, ALLC.

4.     Newman, M.L., Groom, C.J., Handelman, L.D. and Pennebaker, J.W., "Gender differences in language use: An analysis of 14,000 text samples", Discourse Processes,  Vol. 45, No. 3, (2008), 211-236.

5.     Pennebaker, J.W., Francis, M.E. and Booth, R.J., "Linguistic inquiry and word count: Liwc 2001", Mahway: Lawrence Erlbaum Associates,  Vol. 71, No. 2001, (2001), 2001-2009.

6.     Argamon, S., Koppel, M., Pennebaker, J.W. and Schler, J., "Mining the blogosphere: Age, gender and the varieties of self-expression", First Monday,  Vol. 12, No. 9, (2007).

7.     Santosh, K., Bansal, R., Shekhar, M. and Varma, V., "Author profiling: Predicting age and gender from blogs", Notebook for PAN at CLEF,  (2013), 119-124.

8.     Argamon, S., Koppel, M., Pennebaker, J.W. and Schler, J., "Automatically profiling the author of an anonymous text", Communications of the ACM,  Vol. 52, No. 2, (2009), 119-123.

9.     Sapkota, U., Solorio, T., Montes-y-Gomez, M. and Ramírez-de-la-Rosa, G., "Author profiling for english and spanish text", Notebook for PAN at CLEF,  Vol., No., (2013).

10.   Hamidi, H. and Daraee, A., "Analysis of pre-processing and post-processing methods and using data mining to diagnose heart diseases", International Journal of Engineering-Transactions A: Basics,  Vol. 29, No. 7, (2016), 921-929.

11.   Lim, W.-Y., Goh, J. and Thing, V.L., "Content-centric age and gender profiling", Proceedings of the Notebook for PAN at CLEF,  (2013), 130-138.

12.   Darvishi, A. and Hassanpour, H., "A geometric view of similarity measures in data mining", International Journal of Engineering-Transactions C: Aspects,  Vol. 28, No. 12, (2015), 1728-1735.

13.   Mechti, S., Jaoua, M., Belguith, L.H. and Faiz, R., "Author profiling using style-based features", in Proceedings of CLEF, Citeseer., (2013).

14.   Maharjan, S., Shrestha, P. and Solorio, T., "A simple approach to author profiling in mapreduce", in CLEF (Working Notes)., (2014), 1121-1128.

15.   Grivas, A., Krithara, A. and Giannakopoulos, G., "Author profiling using stylometric and structural feature groupings", in CLEF (Working Notes)., (2015).

16.   Palomino-Garibay, A., Camacho-Gonzalez, A.T., Fierro-Villaneda, R.A., Hernandez-Farias, I., Buscaldi, D. and Meza-Ruiz, I.V., "A random forest approach for authorship profiling", Cappellato et al.[8], (2015), 156-164.

17.   Octavia-Maria, S., "Ulea1; 2 and daniel dichiu, bitdefender romania,“automatic profiling of twitter users based on their tweets.”", in Proceedings of CLEF., (2015).

18.   Weren, E.R., Moreira, V.P. and de Oliveira, J.P.M., "Exploring information retrieval features for author profiling", in CLEF (Working Notes)., (2014), 1164-1171.

19.   Weren, E.R., Moreira, V.P. and Oliveira, J., "Using simple content features for the author profiling task", in Notebook for PAN at Cross-Language Evaluation Forum. Valencia, Spain., (2013).

20.   Weren, E.R.D., "Information retrieval features for personality traits", in CLEF (Working Notes)., (2015).

21.   Estival, D., Gaustad, T., Pham, S.B., Radford, W. and Hutchinson, B., "Author profiling for english emails", in Proceedings of the 10th Conference of the Pacific Association for Computational Linguistics (PACLING’07)., (2007), 263-272.

22.   Soler, J. and Wanner, L., "How to use less features and reach better performance in author gender identification", in LREC., (2014), 1315-1319.

23.   Pham, D.D., Tran, G.B. and Pham, S.B., "Author profiling for

vietnamese blogs", in Asian Language Processing, 2009. IALP'09. International Conference on, IEEE., (2009), 190-194.

24.   Dang, D., Giang, B. and Bao, P., "Authorship attribution and gender identification in greek blogs", in 8th International Conference on Quantitative Linguistics (QUALICO)., (2012), 26-29.

25.   Singhal, A., Buckley, C. and Mitra, M., "Pivoted document length normalization", in Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, ACM., (1996), 21-29.

26.   Porter, M., "Developing the english stemmer", http://snowball. tartarus.org,  (2002).