The Complete Guide to Creating Machine-Readable Academic PDFs: Best Practices for Publishers and Editors
A comprehensive guide developed from real-world experience processing thousands of academic articles
Introduction
Over the past months, our team has been developing FullTextCreator, a specialized software solution that converts academic article PDFs into HTML and JATS XML formats for digital publishing and indexing. During this journey—processing hundreds of articles from various journals, disciplines, and countries—we’ve encountered a recurring challenge: inconsistent and incomplete metadata in PDF files.
PDF remains at the heart of academic publishing—and for good reason. Its ability to encapsulate an entire article in a single, self-contained file with embedded tables, figures, and formatting makes it unmatched for portability, sharing, and archival purposes. However, the publishing landscape is evolving. Increasingly, journals are complementing their PDF offerings with HTML, XML, JATS XML, and EPUB formats to enhance accessibility, enable seamless indexing, facilitate data exchange, and comply with emerging standards. These initiatives demonstrably improve a journal’s visibility and discoverability, streamline the indexing process, and meet the growing demands of major databases—many of which now require or strongly prefer structured formats alongside traditional PDFs.
Through our work with JATS XML generation, HTML conversion, and integration with indexing systems, we’ve identified critical patterns that determine whether an article can be processed smoothly or requires extensive manual intervention. The feedback from our users and the edge cases we’ve encountered have revealed that many publishers and content creators are unaware of the specific requirements that enable automated processing.
This guide represents our collective learning—a resource we feel compelled to share with the academic publishing community. Our goal is to help publishers, editors, typesetters, and content creators produce PDF files that are not just visually appealing, but also machine-readable, interoperable, and future-proof.
Whether you’re a journal editor, a typesetting professional, or an author preparing your manuscript, following these guidelines will ensure your content integrates seamlessly with the global academic infrastructure.
Why Does This Matter?
The Digital Academic Ecosystem
Academic publishing has evolved far beyond print journals. Today, your article’s discoverability, citation potential, and long-term preservation depend on its integration with numerous digital systems:
Discovery & Indexing:
- ✅ PubMed/PMC – The world’s largest biomedical literature database
- ✅ Scopus – Elsevier’s comprehensive abstract and citation database
- ✅ Web of Science – Clarivate’s premier citation index
- ✅ Google Scholar – The most widely used academic search engine
- ✅ DOAJ – Directory of Open Access Journals
- ✅ TR Dizin – Turkish national academic index
- ✅ EBSCO – Major academic database aggregator
Identifier & Registration Systems:
- ✅ CrossRef – DOI registration and metadata repository
- ✅ ORCID – Researcher identification system
- ✅ ROR – Research Organization Registry
- ✅ Fundref – Funding acknowledgment registry
Preservation & Archiving:
- ✅ LOCKSS – Lots of Copies Keep Stuff Safe
- ✅ CLOCKSS – Controlled LOCKSS
- ✅ Portico – Digital preservation service
- ✅ Internet Archive – Long-term web archiving
Open Access & Compliance:
- ✅ OpenAIRE – European open science infrastructure
- ✅ CORE – World’s largest aggregator of open access research
- ✅ Unpaywall – Open access availability detection
- ✅ Plan S compliance – Funder mandate requirements
The Cost of Poor Metadata
When PDF metadata is incomplete or inconsistent:
- Delayed Indexing: Articles may take months to appear in databases, or never appear at all
- Lost Citations: Incorrectly formatted references can’t be matched and counted
- Author Misattribution: Authors lose credit for their work
- Funding Non-Compliance: Grant requirements may not be met
- Preservation Gaps: Articles may not qualify for long-term archiving
- Reduced Discoverability: Potential readers can’t find your content
- Manual Processing Costs: Staff time spent fixing avoidable errors
Essential Metadata Requirements
1. Digital Object Identifier (DOI)
The DOI is your article’s permanent digital address. It must be:
Correct Format:
DOI: 10.12345/journalname.2024.001
Common Mistakes to Avoid:
- Missing colon after “DOI”
- Using equals sign (DOI=10.xxx)
- Broken or unresolvable DOI links
- DOI placed only in footer (hard to extract)
Best Practice: Display the DOI prominently on the first page, preferably in the header area with the full https://doi.org/ URL.
2. Author Information
Complete author metadata enables proper attribution and networking.
Required Elements:
Ahmet Yılmaz¹*, Mehmet Kaya², Ayşe Demir¹
¹ Istanbul University, Faculty of Medicine, Istanbul, Turkey
² Ankara University, Faculty of Science, Ankara, Turkey
* Corresponding Author: [email protected]
ORCID:
Ahmet Yılmaz: https://orcid.org/0000-0001-2345-6789
Mehmet Kaya: https://orcid.org/0000-0002-3456-7890
Ayşe Demir: https://orcid.org/0000-0003-4567-8901
OR
Ahmet Yılmaz: 0000-0001-2345-6789
Mehmet Kaya: 0000-0002-3456-7890
Ayşe Demir: 0000-0003-4567-8901
Critical Points:
- Author order must be clear and unambiguous
- Affiliations linked via superscript numbers
- Corresponding author marked with asterisk (*)
- Email address for corresponding author
- ORCID iDs for all authors (16-digit format: 0000-0000-0000-0000)
Why ORCID Matters: ORCID disambiguation prevents author confusion (e.g., distinguishing between “J. Smith” researchers), ensures proper citation counting, and is increasingly required by funders and publishers.
3. Dates (Article Timeline)
Publication dates enable citation tracking and establish priority.
Required Dates:
Received: January 15, 2024
Revised: February 20, 2024 (if applicable)
Accepted: March 10, 2024
Published Online: April 1, 2024
Acceptable Formats:
January 15, 2024(Month DD, YYYY)15 January 2024(DD Month YYYY)2024-01-15(ISO format: YYYY-MM-DD)15.01.2024(DD.MM.YYYY – common in Europe)
Key Terms Recognized:
| English | Variations |
|---|---|
| Received | Submitted, Date received |
| Accepted | Date accepted, Approval date |
| Published | Published online, Online first, Publication date |
| Revised | Revision, Revised version |
Important: Be consistent—use the same date format throughout the document.
4. Journal Information
Journal metadata connects your article to its publication venue.
Required Elements:
Journal of Health Sciences 2024; 15(3): 123-135
ISSN: 1234-5678 (Print) | e-ISSN: 8765-4321 (Online)
Publisher: Academic Publishing House
Components:
- Journal title (full name, not abbreviation only)
- Volume number
- Issue number
- Page range (first page – last page) OR e-location ID
- ISSN (print and/or electronic)
- Publisher name
ISSN Format: Always use the format XXXX-XXXX (four digits, hyphen, four digits/X).
5. Abstract and Keywords
Abstracts are crucial for indexing and discoverability.
Structure:
ABSTRACT
[Abstract text - typically 150-300 words]
Keywords: keyword1, keyword2, keyword3, keyword4, keyword5
Best Practices:
- Use “ABSTRACT” or “Abstract” as a clear heading
- For multilingual journals: provide abstracts in all relevant languages
- Keywords should be separated by commas or semicolons
- Include 3-7 keywords
- Use terms from controlled vocabularies when possible (MeSH for medical articles)
Multilingual Considerations: If your article is in a language other than English, include both:
- Abstract in the article’s language
- English abstract (required by most international indexes)
6. Article Type Classification
Clearly indicate what type of article this is:
Common Types:
| Type | Description |
|---|---|
| Research Article | Original research with methodology and results |
| Review Article | Systematic or narrative review of literature |
| Case Report | Clinical or scientific case description |
| Editorial | Opinion piece by editors |
| Letter to Editor | Correspondence or brief communication |
| Short Communication | Brief research report |
| Meta-Analysis | Statistical analysis of multiple studies |
Display: Include article type prominently, typically above the title.
7. License and Copyright
Open access compliance requires clear licensing.
Recommended Format:
© 2024 The Author(s). This is an open access article under the
CC BY 4.0 license (https://creativecommons.org/licenses/by/4.0/)
Common Licenses:
- CC BY 4.0 – Attribution (most permissive)
- CC BY-NC 4.0 – Attribution-NonCommercial
- CC BY-SA 4.0 – Attribution-ShareAlike
- CC BY-NC-ND 4.0 – Attribution-NonCommercial-NoDerivatives
Important: Include the full URL to the license. This enables automated license detection.
Recommended PDF First Page Layout
Here’s an ideal structure for your article’s first page:
┌─────────────────────────────────────────────────────────────────┐
│ [JOURNAL LOGO] │
│ │
│ Journal of Example Sciences │
│ 2024; Volume 15, Issue 3, Pages 123-135 │
│ ISSN: 1234-5678 | e-ISSN: 8765-4321 │
│ DOI: https://doi.org/10.12345/jes.2024.001 │
│ │
│ ───────────────────────────────────────────────────────── │
│ │
│ RESEARCH ARTICLE │
│ │
│ Title of the Article Goes Here: A Comprehensive Study │
│ │
│ First Author¹*, Second Author², Third Author¹ │
│ │
│ ¹ Department of Science, University of Example, City, Country │
│ ² Institute of Research, Another University, City, Country │
│ │
│ * Corresponding Author: [email protected] │
│ │
│ ORCID: F. Author: 0000-0001-2345-6789 │
│ S. Author: 0000-0002-3456-7890 │
│ T. Author: 0000-0003-4567-8901 │
│ │
│ Received: January 15, 2024 | Accepted: March 10, 2024 │
│ Published Online: April 1, 2024 │
│ │
│ ───────────────────────────────────────────────────────── │
│ │
│ ABSTRACT │
│ │
│ [Abstract text...] │
│ │
│ Keywords: keyword1, keyword2, keyword3, keyword4 │
│ │
│ ───────────────────────────────────────────────────────── │
│ │
│ © 2024 The Author(s). CC BY 4.0 │
│ https://creativecommons.org/licenses/by/4.0/ │
│ │
└─────────────────────────────────────────────────────────────────┘
Article Content and Structural Layout
Not only the first page but also the structure of the body text is of critical importance for machine readability and conversion processes. Paying attention to the following rules when organizing your content will ensure that our software and other indexing systems process your article correctly.
1. Heading Hierarchy and Format
Headings and subheadings within the article must clearly demonstrate the text’s hierarchical structure.
-
Visual Distinction: Main headings (H1), subheadings (H2), and lower-level headings (H3) must differ from one another in size, weight (boldness), or style. For example; main headings could be 14 pt and bold, subheadings 12 pt and bold, and a lower level 12 pt and italic.
-
Consistency: Apply your chosen format consistently throughout the entire article. This ensures software correctly detects heading levels.
2. Image, Table, and Figure Captions
Visual elements and tables are separated from the text flow during machine processing. Therefore, it is absolutely mandatory for each to have a caption indicating what it is.
-
Numbering: Use sequential numbering for each element type (e.g., Table 1, Table 2… or Figure 1, Figure 2…).
-
Descriptive Text: A short and clear title/caption explaining the element must immediately follow the numbering.
-
Position: Table titles should generally be placed above the table, while figure and image captions should generally be placed below the visual, and this rule must be consistent throughout the article.
-
Example:
-
Table 1: Demographic characteristics of patients participating in the study.
-
Figure 3: Schematic representation of the experimental setup.
-
3. Column Layout
The layout of the article text directly affects the accuracy of the automated extraction process.
-
Single Column Recommendation: In every possible case, we strongly recommend using a single-column layout for the article body. Single-column texts are read much more easily and accurately by machines, the text flow remains intact, and HTML/XML conversions are smoother.
-
Strict Necessity: If using two columns is absolutely necessary due to journal design, ensure that the gap between columns is distinct and the text flow (left-to-right, top-to-bottom) is clear. However, remember that multi-column structures increase the risk of conversion errors.
Pre-Publication Checklist
Before finalizing your PDF, verify:
✅ Identification
- [ ] DOI is present and correctly formatted
- [ ] DOI link is functional (resolves to the article)
- [ ] ISSN/e-ISSN is displayed
- [ ] Journal name is complete (not just abbreviation)
✅ Authors
- [ ] All author names are listed in correct order
- [ ] Each author has affiliation number(s)
- [ ] All affiliations are listed with corresponding numbers
- [ ] Corresponding author is marked (*)
- [ ] Corresponding author email is provided
- [ ] ORCID iDs are included for all authors
✅ Dates
- [ ] Received date is present
- [ ] Accepted date is present
- [ ] Published/Online date is present
- [ ] Date format is consistent throughout
- [ ] Dates are logical (received < accepted < published)
✅ Content Metadata
- [ ] Article type is clearly indicated
- [ ] Abstract is present with clear heading
- [ ] Keywords are listed (comma or semicolon separated)
- [ ] Volume, issue, and page numbers are included
✅ Structural and Visual Layout
-
[ ] Headings and subheadings are in different formats (size, bold, etc.) to show hierarchy
-
[ ] All images, tables, and figures have numbered and descriptive captions
-
[ ] Article body is prepared in a single-column layout if possible
✅ Rights & Access
- [ ] Copyright statement is included
- [ ] License type is specified
- [ ] License URL is provided
✅ Multilingual (if applicable)
- [ ] Abstract in article language
- [ ] Abstract in English
- [ ] Keywords in both languages
- [ ] All dates use consistent terminology
Common Mistakes and How to Fix Them
| Problem | Impact | Solution |
|---|---|---|
| DOI in footer only | Extraction failure | Move to header/first page body |
| Unclear author order | Attribution errors | Use numbered sequence |
| Missing ORCID | Author disambiguation fails | Add for all authors |
| Date format mixing | Parsing errors | Use one format consistently |
| No license URL | Open access detection fails | Include full CC URL |
| ISSN typo | Journal matching fails | Double-check format |
| Abstract without heading | Content extraction fails | Add clear “ABSTRACT” heading |
Conclusion
Creating well-structured PDFs is an investment in your journal’s digital future. The metadata you include today determines how discoverable, citable, and preservable your content will be for decades to come.
At FullTextCreator, we’ve built our system to handle variations in formatting—but the cleaner your source files, the more accurate and faster the conversion process. By following these guidelines, you’re not just making our job easier; you’re ensuring your authors’ work reaches its full potential audience.
Questions or need help? Visit fulltextcreator.com or contact us at [email protected].
This guide is provided by the FullTextCreator team as a service to the academic publishing community. We welcome feedback and suggestions for improvement.
Last Updated: January 2025


Comments are closed