SelfDocSeg:

A Self-Supervised vision-based Approach towards Document Segmentation

1Technology Innovation Hub, Indian Statistical Institute, Kolkata, India 2Computer Vision Center, Computer Science Department, Universitat Aut`onoma de Barcelona, Spain 3CVPR Unit, Indian Statistical Institute, Kolkata, India 4Department of Electronics and Electrical Communication Engineering, Indian Institute of Technology Kharagpur, India

A basic methodological distinction between SelfDocSeg and existing approaches. While earlier works utilize information from visual, layout, and textual modalities for large-scale pre-training, we deal with visual cues only for boosting representation learning.

Abstract

Document layout analysis is a known problem to the documents research community and has been vastly explored yielding a multitude of solutions ranging from text mining, and recognition to graph-based representation, visual feature extraction, etc. However, most of the existing works have ignored the crucial fact regarding the scarcity of labeled data. With growing internet connectivity to personal life, an enormous amount of documents had been available in the public domain and thus making data annotation a tedious task. We address this challenge using self-supervision and unlike, the few existing self-supervised document segmentation approaches which use text mining and textual labels, we use a complete vision-based approach in pre-training without any ground-truth label or its derivative. Instead, we generate pseudo-layouts from the document images to pre-train an image encoder to learn the document object representation and localization in a self-supervised framework before fine-tuning it with an object detection model. We show that our pipeline sets a new benchmark in this context and performs at par with the existing methods and the supervised counterparts, if not outperforms.

Architecture

architecture
SelfDocSeg intends to pre-train a CNN image encoder catering specifically catering to document image instance segmentation task. The architecture of SelfDocSeg two encoders, one updated using backpropagation and the other using EMA, and is trained using a focal loss for document object localization and a cosine similarity loss for document object representation learning.

mask-generation
Process for mask generation from document objects using image processing.

 

Results

Qualitative
Samples of document segmentation using Mask RCNN with ResNet50 backbone pre-trained using SelfDocSeg. The left side represents the prediction while the right side represents the ground-truth.

BibTeX

@inproceedings{maity2023selfdocseg,
title={SelfDocSeg: A Self-Supervised vision-based Approach towards Document Segmentation},
author={Subhajit Maity and Sanket Biswas and Siladittya Manna and Ayan Banerjee and Josep Lladós and Saumik Bhattacharya and Umapada Pal},
booktitle={International Conference on Document Analysis and Recognition (ICDAR)},
year={2023}}

Copyright: CC BY-NC-SA 4.0 © Subhajit Maity | Last updated: 22 Aug 2023 |Template Credit: Nerfies