The Complete Guide to Creating Machine-Readable Academic PDFs: Best Practices for Publishers and Editors

A comprehensive guide developed from real-world experience processing thousands of academic articles


Introduction

Over the past months, our team has been developing FullTextCreator, a specialized software solution that converts academic article PDFs into HTML and JATS XML formats for digital publishing and indexing. During this journey—processing hundreds of articles from various journals, disciplines, and countries—we’ve encountered a recurring challenge: inconsistent and incomplete metadata in PDF files.

PDF remains at the heart of academic publishing—and for good reason. Its ability to encapsulate an entire article in a single, self-contained file with embedded tables, figures, and formatting makes it unmatched for portability, sharing, and archival purposes. However, the publishing landscape is evolving. Increasingly, journals are complementing their PDF offerings with HTML, XML, JATS XML, and EPUB formats to enhance accessibility, enable seamless indexing, facilitate data exchange, and comply with emerging standards. These initiatives demonstrably improve a journal’s visibility and discoverability, streamline the indexing process, and meet the growing demands of major databases—many of which now require or strongly prefer structured formats alongside traditional PDFs.

Through our work with JATS XML generation, HTML conversion, and integration with indexing systems, we’ve identified critical patterns that determine whether an article can be processed smoothly or requires extensive manual intervention. The feedback from our users and the edge cases we’ve encountered have revealed that many publishers and content creators are unaware of the specific requirements that enable automated processing.

This guide represents our collective learning—a resource we feel compelled to share with the academic publishing community. Our goal is to help publishers, editors, typesetters, and content creators produce PDF files that are not just visually appealing, but also machine-readable, interoperable, and future-proof.

Whether you’re a journal editor, a typesetting professional, or an author preparing your manuscript, following these guidelines will ensure your content integrates seamlessly with the global academic infrastructure.


Why Does This Matter?

The Digital Academic Ecosystem

Academic publishing has evolved far beyond print journals. Today, your article’s discoverability, citation potential, and long-term preservation depend on its integration with numerous digital systems:

Discovery & Indexing:

  • PubMed/PMC – The world’s largest biomedical literature database
  • Scopus – Elsevier’s comprehensive abstract and citation database
  • Web of Science – Clarivate’s premier citation index
  • Google Scholar – The most widely used academic search engine
  • DOAJ – Directory of Open Access Journals
  • TR Dizin – Turkish national academic index
  • EBSCO – Major academic database aggregator

Identifier & Registration Systems:

  • CrossRef – DOI registration and metadata repository
  • ORCID – Researcher identification system
  • ROR – Research Organization Registry
  • Fundref – Funding acknowledgment registry

Preservation & Archiving:

  • LOCKSS – Lots of Copies Keep Stuff Safe
  • CLOCKSS – Controlled LOCKSS
  • Portico – Digital preservation service
  • Internet Archive – Long-term web archiving

Open Access & Compliance:

  • OpenAIRE – European open science infrastructure
  • CORE – World’s largest aggregator of open access research
  • Unpaywall – Open access availability detection
  • Plan S compliance – Funder mandate requirements

The Cost of Poor Metadata

When PDF metadata is incomplete or inconsistent:

  1. Delayed Indexing: Articles may take months to appear in databases, or never appear at all
  2. Lost Citations: Incorrectly formatted references can’t be matched and counted
  3. Author Misattribution: Authors lose credit for their work
  4. Funding Non-Compliance: Grant requirements may not be met
  5. Preservation Gaps: Articles may not qualify for long-term archiving
  6. Reduced Discoverability: Potential readers can’t find your content
  7. Manual Processing Costs: Staff time spent fixing avoidable errors

Essential Metadata Requirements

1. Digital Object Identifier (DOI)

The DOI is your article’s permanent digital address. It must be:

Correct Format:

DOI: 10.12345/journalname.2024.001

Common Mistakes to Avoid:

  • Missing colon after “DOI”
  • Using equals sign (DOI=10.xxx)
  • Broken or unresolvable DOI links
  • DOI placed only in footer (hard to extract)
  • DOI split across two lines (last digits on the next line cause extraction errors)

Best Practice: Display the DOI prominently on the first page, preferably in the header area with the full https://doi.org/ URL. The DOI must appear as a single, unbroken string on one line.


2. Journal Information

Journal metadata connects your article to its publication venue. This is one of the most commonly missing elements in academic PDFs we process—many articles lack basic journal identification on the first page.

Required Elements:

Journal of Health Sciences 2024; 15(3): 123-135
ISSN: 1234-5678 (Print) | e-ISSN: 8765-4321 (Online)
Publisher: Academic Publishing House

Components:

  • Journal title (full name, not abbreviation only) — must appear on every article’s first page
  • Volume number
  • Issue number
  • Page range (first page – last page) OR e-location ID
  • ISSN (print and/or electronic) — must use XXXX-XXXX format
  • Publisher name

Critical: Without journal name and ISSN on the first page, automated systems cannot identify which journal the article belongs to. This is a fundamental requirement that many journals overlook.


3. Author Information

Complete author metadata enables proper attribution and networking.

Required Elements:

Ahmet Yilmaz¹*, Mehmet Kaya², Ayse Demir¹

¹ Istanbul University, Faculty of Medicine, Istanbul, Turkey
² Ankara University, Faculty of Science, Ankara, Turkey

* Corresponding Author: [email protected]

ORCID:
Ahmet Yilmaz: https://orcid.org/0000-0001-2345-6789
Mehmet Kaya: https://orcid.org/0000-0002-3456-7890
Ayse Demir: https://orcid.org/0000-0003-4567-8901

Critical Points:

  • Author order must be clear and unambiguous
  • Affiliations linked via superscript numbers — use standard superscripts (¹ ² ³), not curly braces {1}, not square brackets [1], not parentheses (1). Superscript numbers must be directly adjacent to the author name with no space between them
  • Corresponding author marked with asterisk (*) — additionally, include a clear line: “Corresponding Author: Name, email” or equivalent
  • Email address for corresponding author — must be a complete, valid email
  • ORCID iDs for all authors (16-digit format: 0000-0000-0000-0000) — provided as full URL or plain ID

Corresponding Author Labels: Use one of these recognized terms:

  • Corresponding Author / Correspondence Author
  • Author for Correspondence / Contact Author
  • Primary Contact / Lead Contact

Why ORCID Matters: ORCID disambiguation prevents author confusion (e.g., distinguishing between “J. Smith” researchers), ensures proper citation counting, and is increasingly required by funders and publishers.


4. Affiliation Format and Placement

Affiliations identify the institutions where the research was conducted. Their correct formatting is critical for machine processing.

Recommended Format:

¹ Istanbul University, Faculty of Medicine, Department of Cardiology, Istanbul, Turkey
² Ankara University, Faculty of Science, Department of Biology, Ankara, Turkey

Critical Rules:

  • Place affiliations directly after the author names — not in the page footer. Footer-placed affiliations are frequently misread, fragmented, or lost during automated extraction
  • Each affiliation must be on its own line with a superscript number matching the author
  • Follow a consistent order: Institution, Faculty/School, Department, City, Country
  • Include ROR ID (Research Organization Registry) when available — e.g., ROR: https://ror.org/02bf6br77
  • Do not combine multiple affiliations on a single line — each institution gets its own numbered entry

Common Mistake: Placing affiliations as page footnotes (below a horizontal rule at the bottom of page 1). While visually acceptable, this causes extraction systems to read the affiliation text out of order, mix it with body text, or skip it entirely.


5. Dates (Article Timeline)

Publication dates enable citation tracking and establish priority.

Required Dates:

Received: January 15, 2024
Revised: February 20, 2024 (if applicable)
Accepted: March 10, 2024
Published Online: April 1, 2024

Acceptable Formats:

  • January 15, 2024 (Month DD, YYYY)
  • 15 January 2024 (DD Month YYYY)
  • 2024-01-15 (ISO format: YYYY-MM-DD)
  • 15.01.2024 (DD.MM.YYYY – common in Europe)

Key Terms Recognized:

English Variations
Received Submitted, Date received
Accepted Date accepted, Approval date
Published Published online, Online first, Publication date
Revised Revision, Revised version

Important: Be consistent—use the same date format throughout the document.


6. Abstract and Keywords

Abstracts are crucial for indexing and discoverability.

Structure:

ABSTRACT

[Abstract text - typically 150-300 words]

Keywords: keyword1, keyword2, keyword3, keyword4, keyword5

Best Practices:

  • Use “ABSTRACT” or “Abstract” as a clear, distinct heading — not embedded inline with body text
  • The abstract must be clearly separated from the introduction section
  • Keywords should be separated by commas or semicolons — use one separator consistently, not mixed
  • Include 3-7 keywords
  • Use terms from controlled vocabularies when possible (MeSH for medical articles)
  • For multilingual journals: provide abstracts in all relevant languages

Multilingual Considerations: If your article is in a language other than English, include both:

  • Abstract in the article’s language
  • English abstract (required by most international indexes)

7. Article Type Classification

Clearly indicate what type of article this is:

Common Types:

Type Description
Research Article Original research with methodology and results
Review Article Systematic or narrative review of literature
Case Report Clinical or scientific case description
Editorial Opinion piece by editors
Letter to Editor Correspondence or brief communication
Short Communication Brief research report
Meta-Analysis Statistical analysis of multiple studies

Display: Include article type prominently, typically above the title.


8. License and Copyright

Open access compliance requires clear licensing.

Recommended Format:

© 2024 The Author(s). This is an open access article under the
CC BY 4.0 license (https://creativecommons.org/licenses/by/4.0/)

Common Licenses:

  • CC BY 4.0 – Attribution (most permissive)
  • CC BY-NC 4.0 – Attribution-NonCommercial
  • CC BY-SA 4.0 – Attribution-ShareAlike
  • CC BY-NC-ND 4.0 – Attribution-NonCommercial-NoDerivatives

Critical Requirements:

  • License type must appear as text — using only a Creative Commons icon/badge image is not sufficient for automated detection. The text “CC BY 4.0” (or whichever license applies) must be explicitly written
  • Include the full URL to the license (e.g., https://creativecommons.org/licenses/by/4.0/)
  • Place the license information on the first page — license text buried at the end of the article or on the last page is frequently missed by extraction systems

First Page Layout: The Golden Rule

The first page of an academic article must contain all article metadata and nothing else. No article body text (Introduction, Methods, etc.) should begin on the first page. The first page is exclusively reserved for:

  • Journal name, ISSN, volume, issue, pages
  • DOI
  • Article type
  • Title
  • Authors with affiliations
  • Corresponding author information
  • ORCID identifiers
  • Article history dates
  • Abstract and keywords
  • License and copyright notice

Why this matters: When the Introduction or body text begins on the first page alongside metadata, automated extraction systems struggle to distinguish where metadata ends and content begins. This leads to body text being misidentified as author names, affiliations, or abstract content — producing corrupted metadata that requires manual correction.

Recommended First Page Layout:

+---------------------------------------------------------------+
|  [JOURNAL LOGO]                                               |
|                                                               |
|  Journal of Example Sciences                                  |
|  2024; Volume 15, Issue 3, Pages 123-135                      |
|  ISSN: 1234-5678 | e-ISSN: 8765-4321                          |
|  DOI: https://doi.org/10.12345/jes.2024.001                   |
|                                                               |
|  -----------------------------------------------------------  |
|                                                               |
|  RESEARCH ARTICLE                                             |
|                                                               |
|  Title of the Article Goes Here: A Comprehensive Study        |
|                                                               |
|  First Author¹*, Second Author², Third Author¹                |
|                                                               |
|  ¹ Department of Science, University of Example, City, Country|
|  ² Institute of Research, Another University, City, Country   |
|                                                               |
|  * Corresponding Author: [email protected]          |
|                                                               |
|  ORCID: F. Author: 0000-0001-2345-6789                        |
|         S. Author: 0000-0002-3456-7890                        |
|         T. Author: 0000-0003-4567-8901                        |
|                                                               |
|  Received: January 15, 2024 | Accepted: March 10, 2024        |
|  Published Online: April 1, 2024                              |
|                                                               |
|  -----------------------------------------------------------  |
|                                                               |
|  ABSTRACT                                                     |
|                                                               |
|  [Abstract text...]                                           |
|                                                               |
|  Keywords: keyword1, keyword2, keyword3, keyword4             |
|                                                               |
|  -----------------------------------------------------------  |
|                                                               |
|  (c) 2024 The Author(s). CC BY 4.0                            |
|  https://creativecommons.org/licenses/by/4.0/                 |
|                                                               |
+---------------------------------------------------------------+

Article Content and Structural Layout

Not only the first page but also the structure of the body text is of critical importance for machine readability and conversion processes. Paying attention to the following rules when organizing your content will ensure that our software and other indexing systems process your article correctly.

1. Heading Hierarchy and Format

Headings and subheadings within the article must clearly demonstrate the text’s hierarchical structure.

  • Visual Distinction: Main headings (H1), subheadings (H2), and lower-level headings (H3) must differ from one another in size, weight (boldness), or style. For example; main headings could be 14 pt and bold, subheadings 12 pt and bold, and a lower level 12 pt and italic.
  • Consistency: Apply your chosen format consistently throughout the entire article. This ensures software correctly detects heading levels.

2. Image, Table, and Figure Handling

Visual elements and tables are separated from the text flow during machine processing. Therefore, it is absolutely mandatory for each to have a caption indicating what it is.

  • Numbering: Use sequential numbering for each element type (e.g., Table 1, Table 2… or Figure 1, Figure 2…).
  • Descriptive Text: A short and clear title/caption explaining the element must immediately follow the numbering.
  • Position: Table titles should generally be placed above the table, while figure and image captions should generally be placed below the visual, and this rule must be consistent throughout the article.
  • Example:
    • Table 1: Demographic characteristics of patients participating in the study.
    • Figure 3: Schematic representation of the experimental setup.

Figure Color Space — Critical for Extraction

  • All figures must be embedded in RGB color space — figures saved in CMYK (print-oriented) or Indexed color formats cannot be extracted by many OCR and conversion tools, including commonly used engines like Marker
  • This applies especially to chemical structure diagrams, charts, graphs, and schematics — these are frequently saved in CMYK and become invisible to automated extraction
  • Even black-and-white figures should be embedded as RGB Grayscale rather than Indexed color
  • Minimum resolution: 300 DPI for print quality, 150 DPI acceptable for online-only

3. Running Headers and Footers

Many journals add repeating headers and footers on every page — such as a shortened article title on even pages and “Author et al.” on odd pages.

These should be avoided whenever possible after the first page. Running headers and footers are embedded into the text stream by OCR systems and appear as random lines inserted into the article body. This causes:

  • Author names appearing in the middle of paragraphs
  • Truncated titles appearing inside the reference list
  • Sentences being split by metadata fragments

Recommendations:

  • Preferred: Use only page numbers after the first page — no running headers or footers
  • If running headers are required: Use a clearly different font size (e.g., 8pt vs 10pt body) and place them in the PDF header/footer area using proper PDF tagging (artifact tagging) so extraction tools can identify and skip them
  • Never place running headers in the same text flow as body content

4. Column Layout

The layout of the article text directly affects the accuracy of the automated extraction process.

  • Single Column Recommendation: In every possible case, we strongly recommend using a single-column layout for the article body. Single-column texts are read much more easily and accurately by machines, the text flow remains intact, and HTML/XML conversions are smoother.
  • Strict Necessity: If using two columns is absolutely necessary due to journal design, ensure that the gap between columns is distinct and the text flow (left-to-right, top-to-bottom) is clear. However, remember that multi-column structures increase the risk of conversion errors.

5. Reference List Format

The reference section must follow consistent formatting rules for reliable extraction.

  • Clear heading: Start the reference section with a distinct heading — “References” or “Bibliography” — using the same heading style as other major sections
  • Consistent entry format: Every reference must begin the same way — either all numbered (1., 2., 3.) or all bulleted. Do not mix formats (e.g., first five references without numbering, rest numbered)
  • DOI on the same line: Keep the DOI link on the same line as the rest of the reference. A DOI split across two lines (with the last digit on the next line) causes extraction errors and broken links
  • Clear section boundaries: If an Appendix, Supplementary Material, Acknowledgements, or Declarations section follows the references, it must begin with its own distinct heading — otherwise the appendix content gets merged into the last reference entry

Pre-Publication Checklist

Before finalizing your PDF, verify:

Identification

  • [ ] DOI is present, correctly formatted, and displayed as an unbroken string
  • [ ] DOI link is functional (resolves to the article)
  • [ ] ISSN/e-ISSN is displayed on the first page
  • [ ] Journal name is complete (full name, not just abbreviation)
  • [ ] Volume, issue, and page numbers are present

Authors

  • [ ] All author names are listed in correct order
  • [ ] Each author has superscript affiliation number(s) — using ¹ ² ³ format
  • [ ] All affiliations are listed directly after author names (not in page footer)
  • [ ] Affiliations follow order: Institution, Faculty, Department, City, Country
  • [ ] Corresponding author is marked (*) with name and email clearly stated
  • [ ] ORCID iDs are included for all authors
  • [ ] ROR IDs are included for institutions (when available)

Dates

  • [ ] Received date is present
  • [ ] Accepted date is present
  • [ ] Published/Online date is present
  • [ ] Date format is consistent throughout
  • [ ] Dates are logical (received < accepted < published)

Content Metadata

  • [ ] Article type is clearly indicated (Research Article, Review, Case Report, etc.)
  • [ ] Abstract is present with clear “Abstract” heading
  • [ ] Keywords are listed with consistent separator (comma or semicolon, not mixed)
  • [ ] Abstract and keywords appear before the Introduction section

First Page Rule

  • [ ] First page contains ONLY metadata — no Introduction or body text begins on page 1
  • [ ] All metadata elements listed above are present on the first page

Structural and Visual Layout

  • [ ] Headings and subheadings are in different formats (size, bold, etc.) to show hierarchy
  • [ ] All images, tables, and figures have numbered and descriptive captions
  • [ ] All figures are embedded in RGB color space (not CMYK or Indexed color)
  • [ ] Article body is prepared in a single-column layout if possible
  • [ ] No running headers/footers after page 1 (or properly tagged as artifacts)
  • [ ] Reference list uses consistent entry format throughout
  • [ ] Sections after references (Appendix, Declarations) have distinct headings

Rights & Access

  • [ ] Copyright statement is included
  • [ ] License type is specified as text (not icon only) — e.g., “CC BY 4.0”
  • [ ] License URL is provided
  • [ ] License information appears on the first page

Multilingual (if applicable)

  • [ ] Abstract in article language
  • [ ] Abstract in English
  • [ ] Keywords in both languages
  • [ ] All dates use consistent terminology

Common Mistakes and How to Fix Them

Problem Impact Solution
DOI in footer only Extraction failure Move to header/first page body
DOI split across lines Broken DOI, incomplete extraction Keep entire DOI on a single line
Missing journal name/ISSN Journal identification fails Add full journal name and ISSN to first page
Unclear author order Attribution errors Use numbered sequence with superscripts
Affiliations in page footer Affiliations misread or lost Place affiliations directly after author names in body
Missing ORCID Author disambiguation fails Add for all authors
Date format mixing Parsing errors Use one format consistently
License as icon only Automated license detection fails Write license type as text: “CC BY 4.0”
No license URL Open access detection fails Include full Creative Commons URL
ISSN typo Journal matching fails Double-check format (XXXX-XXXX)
Abstract without heading Content extraction fails Add clear “ABSTRACT” heading
Body text on first page Metadata/content boundary confusion Reserve first page for metadata only
Running headers in text flow Author names/titles appear in body Remove running headers or tag as artifacts
Figures in CMYK color Images not extracted by OCR Convert all figures to RGB before embedding
Mixed reference format Some references not extracted Use consistent numbering or bullet format
No heading before Appendix Appendix content merged into references Add clear “Appendix” heading after references

Conclusion

Creating well-structured PDFs is an investment in your journal’s digital future. The metadata you include today determines how discoverable, citable, and preservable your content will be for decades to come.

At FullTextCreator, we’ve built our system to handle variations in formatting—but the cleaner your source files, the more accurate and faster the conversion process. By following these guidelines, you’re not just making our job easier; you’re ensuring your authors’ work reaches its full potential audience.

Questions or need help? Visit fulltextcreator.com or contact us at [email protected].


This guide is provided by the FullTextCreator team as a service to the academic publishing community. We welcome feedback and suggestions for improvement.

Last Updated: April 2026

Tags:

Comments are closed

0
    0
    Your Cart
    Your cart is emptyReturn to Shop