The Current State Of Google Scholar: Everything We’ve Learned

The purpose of this article is to serve as a reference to the current state of Google requirements, specifically as they pertain to scholarly literature. Several of our clients must satisfy guidelines as part of Google Scholar, one of the most widely used free academic search engines. We aim to optimize our clients’ content for search terms through ongoing SEO best practices while making sure we hit all the marks for Google Scholar inclusion as well. And here, we’ll share what we’ve learned.

Part 1. Google Scholar 101: What It Is & How It Works

How does it process content, and how does it compare to Google’s general search engine?

Researchers and scholars use Google Scholar to search, find, and access relevant journal articles. Unlike other databases, its search functionality focuses on individual articles rather than entire journals. Inclusion in Google Scholar can expand an article’s accessibility and reach. Still, an article must meet specific criteria for the search engines to find it and consider it a legitimate source. The search engine does not index all of the content to which it has access. We’ll share more information about legitimacy guidelines to follow.

Like Google, Google Scholar is a crawler-based search engine. It uses spiders that crawl web pages to identify new content. Automated software known as “parsers” identify bibliographic data and references. It identifies content across all academic disciplines, from all countries, in all languages. In addition, it has access to all crawlable scholarly content published online and can use citations in the articles it indexes to find other related content.

What advantages does Google Scholar offer to journals?

This unique search engine can improve the chances that new readers will find, share, and cite individual journal articles. Scholars may freely search for academic journal articles without needing access to subscription-based (A&I) databases or prior knowledge of specific journals. They may even download articles for their future reference, too.

Google Scholar does a great job finding multiple versions of scholarly articles and theses, including various publisher sites and open access journals. When indexing multiple versions, the full text from the publisher is the primary version.

Part 2. Google Scholar Help: Getting Indexed

Scholarly SEO articles generally contain these elements:

For a website’s inclusion into Google Scholar, it must follow specific criteria. The website must consist primarily of scholarly articles, such as original research articles, technical reports, journal papers, conference papers, dissertations, or abstracts. Google Scholar does not consider news or magazine articles, books, book reviews, or editorials appropriate.

Secondly, make the website freely available without requiring human or search engine robot readers to log into the website, accept disclaimers, dismiss pop-up or interstitial advertisements, or install special software.

Finally, it must follow all Google Scholar technical guidelines to become indexed. Check the indexation status of a journal article by searching the journal website domain in scholar.google.com. Only full-text content, in either HTML and/or PDF versions, will be indexed. If you find it is not being indexed, you’ll want to ensure that all of the following inclusion criteria are set up correctly.

How to set up articles for inclusion in Google Scholar:

Publish each article on its own URL
Place each article and each abstract in its own HTML or PDF file
Make the full text freely available to users and crawlers
Export bibliographic data in meta tags

We’ve seen firsthand the importance of publishing each article on its own URL, with each article and each abstract placed in a separate HTML or PDF file.

Also, the journal site must be available to users and crawlers, including the abstract. The website must make either the full text of the articles or their complete author-written abstracts freely available when users click on the URLs in Google search results. All must be visible to users without requiring them to scroll down, click buttons, or dismiss pop-ups.

Lastly, the site must be able to export bibliographic databases in HTML meta tags. We’ll go into more detail on proper meta tag configuration a bit later.

If the website is custom-built, and there’s some uncertainty about its ability to support indexing, there is the option to move it to a journal hosting service. Services like Atypon, Highwire, Ingena, and Silverchair have built-in features that automatically support full-text indexing in Google Scholar.

Google Scholar inclusion guidelines – HTML:

Now, let’s dive into some specific guidelines. We’ve separated these sections by HTML and PDF guidelines, so if you’re only interested in PDF criteria, you can skip down a bit. As far as HTML criteria, it is essential to check that the HTML text is searchable. Each scholarly document or journal article must be smaller than 5MB and in a separate HTML file.

Each journal article should include the paper’s title (not journal) in a large font at the top of the first page. Then, the authors listed right below the title on a separate line. Mark the HTML document or PDF section that contains references to other works with a standard heading, such as “References” or “Bibliography” on a line just by itself.

Additionally, journal publishers should contact Google Scholar to request inclusion in the index with the Google Scholar Inclusion Request Form. Once received, Google Scholar search robots should find the article and include it within several weeks.

Configuring the metadata:

Configure the publisher’s software to export bibliographic data in HTML meta tags. To check that these tags are present, visit several abstracts and view their HTML source. Here are some examples as to include in the meta tags:

Title tag = the title of the paper, not journal or website.
Author tag = the author of the paper, not website. Put each author name in a separate tag and omit all affiliations and certifications.
Publication tag = the date of publication that would normally be cited in references to this paper from other papers.
Publishers need to provide at least these three fields. Pages that do not offer any one of these three will be processed as if they had no meta tags at all. For a list of all necessary tags, see here.

With these tags in place, researchers using Google Scholar are shown the relevant articles matching the metadata. If there is access to the full text, users may begin reading. Otherwise, they will see information from the publisher or rights holder on how to access the material.

A note on meta tags for the abstract: While the abstract needs to be visible to the user, the meta tags are only visible to search robots. Thus, it’s fine to display the abstract as a paragraph of text with a heading that says “Abstract.”

Google Scholar inclusion guidelines – PDF:

And now, onto some specific inclusion guidelines for PDF documents. Place each article in a separate PDF file. If your articles are just in PDF format, they can still be indexed – as long as the size doesn’t exceed 5MB. For larger documents, use the Google Book Search service instead of Google Scholar. Also, use Google Book Search to publish textbooks and monographs. Google Scholar automatically includes scholarly works from Google Book Search.

Be sure that the full text is in a PDF file that ends with “.pdf.” Also, ensure that the text is searchable. To do this, open the file in Adobe Acrobat Reader, click “Find” and confirm that you can search for and find words within the document.

If the paper is only available in PDF format, it is still possible to index the content without meta tags, but the document must follow these conventions:

Title must be at least 24 pt font size or inside an <h1> or <h2> tag at the top.
The authors must be listed next to the title, in a slightly smaller font that is still larger than normal text, or in an <h3> tag.
Bibliographic citation to a published version of the paper (on a line by itself) and place it inside the header or footer of the first page in the PDF file. e.g., “J. Biol. Chem., vol. 234, no. 8, pp. 1971-1975, August 1959”

For all inclusion guidelines, reference the Inclusion Guidelines for Webmasters.

Setting up publisher profiles:

Publishers with multiple offerings indexed will want to manage their Google Scholar profiles. A publisher’s profile includes a list of publications that can sort by date or the number of times cited work. Publishers may also see which publications have cited them. To set up a profile:

Log in to a Google account.
Go to Google Scholar and click on My Citations.
Follow the instructions, add affiliation information, and validate an address.
Add keywords relevant to the published research.
Add publications – Google will likely suggest the correct ones and ask you to confirm that they are yours. To find missing publications, you may add them manually or search using article titles.
Make the profile public scholars can find the articles.

Part 3. Google Scholar Crawl Guidelines

We recommend referencing the crawl guidelines regarding the technical requirements and possible solutions. Google Scholar’s crawlers need to discover and fetch the URLs of all articles and periodically refresh their content from the website. If an article or website’s meta tags or bibliographic data need updating, it can take up to six to nine months to reflect these changes in Google Scholar’s search results. It typically adds new articles several times a week, but takes about six to nine months to update articles. Find our summary of several other crawl guidelines below:

Google Scholar recommends that the URL of every article is reachable from the homepage by following at most ten simple HTML links. A straightforward way to achieve this is to list all articles on a single HTML page.

Since the articles need to be available to users and crawlers, there are some redirect guidelines. If you need to move your articles to new URLs, set up HTTP 301 redirects from the old location of each article to its new location. Do not redirect article URLs to the homepage – users need to see at least the abstract when they click on your URL in Google results.

If the website uses a robots.txt file, e.g., www.example.com/robots.txt, it must not block Google’s search robots from accessing articles or browsing URLs.

Part 4. Some Ranking Insights & Tips

Google Scholar algorithms extract bibliographic information and citations from articles and use this information for ranking. The number of citations to a particular article helps to determine its rank within Google Scholar search results. Grouping versions allow Google Scholar to collect all citations to all versions of a work, which can significantly improve the position of an article in search results. Providing relevant metadata about your articles can also help increase the likelihood of identifying all citations to your articles.

Keep in mind that Google users must see at least the complete abstract or the first full page of each article, which may pose a challenge. Please keep reading to discover several recommendations we have made to help clients overcome this challenge.

Part 5. Greenlane’s Findings and Observations

The Challenge

A premier peer-reviewed medical journal site wanted more inclusion in Google Scholar but did not permit unrestricted access to all of their content.

The Goal

We aimed to dive deeper into subscription and paywalled content guidelines for the client so that their helpful content may still be discoverable within search results for scholars.

Observations

One question was: “How does Google know when things are coming off being paywalled?”

Greenlane shared that Google would discover this as the page is recrawled. There is not a specific signal, markup, etc., we can give in this case. While there isn’t a guarantee, resubmitting those pages in a new XML sitemap could allow Google to discover the change faster.

Google recommends indicating paywalled or restricted content to Google via structured data. Google’s developer documentation contains guidelines on implementing paywall structured data. This structured data is used in conjunction with flexible sampling, where a CSS class given to the sampled content is then targeted from within the structured data.

Flexible sampling is a method where some amount of the article content is available above the “fold,” and obscures content after the fold. Google describes this cutoff as ideally “a few sentences,” though in practice, one should show enough content above the fold to give search engines a clear understanding of the page’s content (SEO value and relevant keywords) while still achieving the effect of enticement to subscribe. Google envisions the use of flexible sampling in conjunction with metered free content as a best practice. In this way, after a user exhausts their quota of free content, flexible sampling takes over until the meter resets. You can read more about flexible sampling here.

Are you looking to discuss these concepts in more detail? If so, contact us today to learn more about how Greenlane can help your business.