Submitted:
27 May 2024
Posted:
29 May 2024
You are already at the latest version
Abstract
Keywords:
1. Introduction
- C1: Text Extraction from PDF: PDF reports can be a combination of multiple images, overlapping text elements, annotations, metadata and unstructured text integrated together in no specific PDF format. Extracting text from such reports can be difficult, challenging and lead to misspelled text and loss of specific topic-related context. Another issue is missing and noisy attributes. Text data might not have all the attributes which we are looking for. Therefore, visual attribute extraction plays an important role.
- C2: Image Extraction from PDF: Images in PDF reports can be embedded, compressed down to reduce size, in various formats like JPEG, PNG etc. Extracting images while maintaining the resolution and quality of images requires specialized handling to accurately preserve the original appearance. Also, images could bring multi-labeled attributes which can confuse the model but can be mitigated by merging certain attribute values to help with model inferences.
- C3: Extracting Product Attributes: Product tags extracted from text/images needs to be carefully mined to match product attributes. The attributes differ based on the category of products we are referring to and can have multi-labeled attributes. For example, women’s tops will have sleeve related attribute whereas women’s trousers will have type of fit attribute and sleeve attribute will be irrelevant.
- C4: Mapping Product Attributes to Product Catalog: E-commerce catalog has specific products and attributes mapped to them. On-boarding new attributes based on PDF reports, requires new attribute creation/refactoring existing attributes.
- Novel Problem Formulation: We propose the end-to-end model of jointly extracting the trending product attributes and hashtags from PDF files consisting of text and image data and mapping it back with the product catalog for the final product attributes values. An example of end to end execution of product attribute extraction and mapping is shown in the Figure 2. Due to Walmart Privacy Requirements, models and datasets are not open to public. We have elaborated the details of each model, and readers can use LLM model of their own choice.
- Flexible Framework: We develop a general framework PAE for extracting text and images from PDF files and then generating product attributes. All the components are easily modified to enhance the capability or to use the framework partially for other applications. The extraction engine can be used to extract attributes for different categories of products like Electronics, Home decor etc.
- Experiments: We performed extensive experiments in real-life datasets to demonstrate PAE’s efficacy. It successfully discovers attribute values from text and image data with a high F1-score of 96.8%, outperforming state-of-the-art models. This proves its ability to produce stable and promising results.
2. Problem Definition

3. Product Attribute Extraction
3.1. Text Extraction from PDF

3.2. Image Extraction from PDF
3.3. Attribute Extraction from Text


3.4. Attribute Extraction from Images



3.5. Hashtag Detection in Text

3.6. Product Attribute Matching
4. Experiments
- (Q1) How accurate is our proposed method PAE when compared to other baselines?
- (Q2) How sensitive is PAE w.r.t different parameters?
- (Q3) How time consuming is PAE?
4.1. Data-Set Description
4.2. Evaluation Measures
4.3. Baselines
4.4. Accuracy of PAE
4.5. Parameter of Sensitivity for PAE
4.5.1. Sensitivity to LLM Prompt for Text Data
- Prompt 1: "Give me all clothing characteristics of a product from the following text:"
- Prompt 2: "Give me color, sleeve style, product type, material, cloth features, categories, and neck attributes from the following text:"
- Prompt 3: "I want you to act as a product attribute extractor in retail space.Given the unstructured text data, you need to find different product attributes in the text. For example: For Input as ‘Long contrast fabric Sleeve red cotton adult polo shirts for men with contemporary design element’, the attribute extractor will return color attribute is red, sleeve attribute is Long, style sleeve attribute is contrast fabric, product type attribute is polo shirts, material attribute is cotton, feature attribute is contemporary,categories is polo shirts, gender attribute is men and neck attribute is NA. Give me attributes like color, sleeve style, product type, material, features, categories, and neck attributes from the following text: "
4.5.2. Sensitivity to LLM Parameters
4.6. CPU Time Analysis


5. Conclusions and Future Work
- The proposed framework effectively identifies the attributes of from PDF files to achieve an assortment planning task. To further enhance the capability, we tailored our proposed framework towards flexibility where extraction of data and attributes can be easily enhanced and modified for domain-specific applications.
- Through experimental evaluation on multiple datasets, we show that PAE provides accurate attributes and is significantly faster in terms of CPU run time.
Acknowledgments
References
- Logan IV, R.L.; Humeau, S.; Singh, S. Multimodal attribute extraction. arXiv preprint arXiv:1711.11118 2017. [CrossRef]
- De la Comble, A.; Dutt, A.; Montalvo, P.; Salah, A. Multi-modal attribute extraction for e-commerce. arXiv preprint arXiv:2203.03441 2022. [CrossRef]
- Zhu, T.; Wang, Y.; Li, H.; Wu, Y.; He, X.; Zhou, B. Multimodal joint attribute prediction and value extraction for e-commerce product. arXiv preprint arXiv:2009.07162 2020. [CrossRef]
- Ghosh, P.; Wang, N.; Yenigalla, P. D-Extract: Extracting dimensional attributes from product images. WACV 2023, 2023.
- Zheng, G.; Mukherjee, S.; Dong, X.L.; Li, F. Opentag: Open attribute value extraction from product profiles. Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, 2018, pp. 1049–1058.
- Bougouin, A.; Boudin, F.; Daille, B. Topicrank: Graph-based topic ranking for keyphrase extraction. International joint conference on natural language processing (IJCNLP), 2013, pp. 543–551.
- Xu, H.; Wang, W.; Mao, X.; Jiang, X.; Lan, M. Scaling up open tagging from tens to thousands: Comprehension empowered attribute value extraction from product title. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 5214–5223.
- Shinyama, P. pdfminer. https://www.unixuser.org/~euske/python/pdfminer/, 2004.
- Cushman, J. PDFQuery. https://github.com/jcushman/pdfquery/tree/master, 2013.
- Belval, E. pdf2image. https://pypi.org/project/pdf2image/, 2017.
- Google. Google Cloud Vision API. https://cloud.google.com/python/docs/reference/vision/latest.
- claird. PyPDF4. https://pypi.org/project/PyPDF4/, 2018.
- Kim, W.; Son, B.; Kim, I. Vilt: Vision-and-language transformer without convolution or region supervision. International conference on machine learning. PMLR, 2021, pp. 5583–5594.
- Li, J.; Li, D.; Xiong, C.; Hoi, S. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. International conference on machine learning. PMLR, 2022, pp. 12888–12900.






| Dataset | |||||
| Boys Apparel | 7 | 11 | 32 | 0 | Y |
| Women’s Cut Sew | 7 | 30 | 24 | 5 | Y |
| Women’s Woven Tops | 6 | 28 | 24 | 6 | Y |
| Country Life Boys | 12 | 12 | 66 | 3 | Y |
| Knitwear Jersey | 10 | 13 | 54 | 8 | Y |
| Modern Occasion | 10 | 16 | 60 | 10 | Y |
| Jackets Outerwear | 7 | 9 | 9 | 7 | Y |
| Woven Tops | 7 | 10 | 9 | 6 | Y |
| Knitwear Core | 6 | 11 | 31 | 4 | Y |
| Knitwear Fashion | 12 | 21 | 70 | 7 | Y |
| Woven Tops Core | 6 | 13 | 35 | 0 | Y |
| Woven Tops Fashion | 13 | 24 | 74 | 4 | Y |
| Dataset | F1-score (Text) | F1-score (Image) |
| Boys Apparel | ||
| Women’s Cut Sew | ||
| Women’s Woven Tops | ||
| Country Life Boys | ||
| Knitwear Jersey | ||
| Modern Occasion | ||
| Jackets Outerwear | ||
| Woven Tops | ||
| Knitwear Core | ||
| Knitwear Fashion | ||
| Woven Tops Core | ||
| Woven Tops Fashion |
| Dataset | PAE | Topic Rank [6] | sOpenTag[7] |
| Precision | |||
| True Positive Rate | |||
| Accuracy | |||
| F1-Score |
| Attributes | PAE | Vilt [13] | BLIP[14] |
| Color | |||
| Sleeve Style | |||
| Product Type | |||
| Material | |||
| Features | |||
| Categories | |||
| Age Group | |||
| Neck |
| Dataset | Prompt 1 | Prompt 2 | Prompt 3 |
| Precision | |||
| True Positive Rate | |||
| Accuracy | |||
| F1-Score |
| Dataset | ||||
| Precision | ||||
| True Positive | ||||
| Accuracy | ||||
| F1-Score |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).