Are the Same Page Slug With and Without index.html Counted as Duplicate Content?



Chris
Hi all,
I have a client who I do some work for who has every page on their website as for example: standard URL www.domain/news-painting and then they also have www.domain/news-painting /index.html as a duplicate like this.
What is the best way to deal with these /index.html pages?
16 πŸ’¬πŸ—¨

πŸ“°πŸ‘ˆ

Micha
Set canonical link relations in the pages' HEAD sections so the pages define their canonical URLs:
<link rel="canonical" href="[preferred URL]" />

Chris ✍️ » Micha
Is it best to keep these /index.html pages active as they are and then just add <link rel="canonical" href="[preferred URL]" /> instead of deleting the pages and 301 re-direct?
Micha Β» Chris
Are you sure these are separate pages? Most of the time when I see this, the "index.html" page is literally the same page served as the folder URL. The problem emerges because the Content Management System (cms) generates links pointing to the ".html" file.
Be careful about deleting pages if you're not sure what's going on.
Chris ✍️ » Micha
Sorry, yes they are exactly the same pages throughout the whole website. All duplicates with the /index.html
Micha Β» Chris
So there's nothing to delete.
Technically, you need to determine if the internal linking can be fixed. But even if not, the canonical link relations should help the search engines see what's going on.
Don't try to redirect "/folder/index.html" to "/folder/". That typically creates a redirect loop.
Chris ✍️ » Micha
The search engines are fine as they are picking up the non /index.html pages, so is it still best to add canonical link as you said and then resolve this for all future pages? Many Thanks for your help btw.
Micha Β» Chris
Based on what you've told me above, I would add the canonicalization.


Truslow πŸŽ“
I gave you the answer to this last night, but… I guess you didn't like it?
Canonical isn't the way to go because that still leaves two pages – so links to / and links to index.html will never consolidate to power one page.
You CAN tell Google to track them as one page in analytics but that still doesn't get your link mojo mixing properly. https://support.google.com/analytics/answer/1009675?hl=en
Honestly, I haven't seen a host that didn't automatically resolve these things in nearly a decade. As I said last night, check with the host. If it's a dedicated server – check with whomever runs it. Whatever the default page is (index.htm, index.html, index.php – or even default.htm and the rest on a windows box) should automatically resolve properly.
If the tech person looks at you strangely then a) get a new tech person or b) https://www.djtechblog.com/SEO/redirect-default-index-html-to-root-SEO-basics/
Redirection won't cause an issue because if the default page is set up properly in Apache or whatever you're using (e.g. https://ubiq.co/tech-blog/how-to-change-default-index-page-in-apache/) then they'll both throw a proper 200 header and redirecting the index.html to the / will just work. Internally you're "rewriting" index.html to be / – and just sending that 301 header to make sure bots and browsers with caches etc. can hit the right page and all the juice flows in the right direction.

πŸ“°πŸ‘ˆ

Chris ✍️ » Truslow
I didn't dislike the answer, I usually like to get a couple of opinions, many thanks for your help on this one πŸ™‚
Micha Β» Chris
The question here is, are there 2 pages or just 1 page being referred to differently?
I got the impression from your responses there is only 1 page. I also got the impression that it's already defined as the default index page. If that's true, then there's nothing to fix on the server side – it's an internal linking issue.
The canonical directive is just a prudent measure, but I would definitely try to fix the internal linking. Rewriting the URLs can mask the issue but doesn't fix it.
I'm not saying don't rewrite them. It's 1 command. But if it were my site, I'd rather fix the way the links are generated.
Chris ✍️ » Micha
90+ pages on the website each page has / version and index.html version so it is showing as duplicate content in SEMrush and I'm worried about future pages.
Micha Β» Chris
It sounds like SEMrush is following malformed internal links.
You should verify what the default index extension is in the server configuration. If it's Apache you can set up a fallback list, so basically anything ending in ".php", ".htm", ".html", ".shtml" would be treated as the folder index page. I suspect that's already in place, or the URL rewriting Stockbridge recommends.
But SEMrush is following internal links. To fix that annoyance, you need to fix the internal links (although I think you can tell their crawler to ignore some links – I know someone's crawler has that option).
Truslow πŸŽ“
The fact that those index.html files aren't redirecting is why I'm saying to check with the server admin. That normally just happens if it's set up right. It's definitely not.
For example, the file actually called to load any WordPress site in the world is "index.php" – and every one that is set up properly just resolves to the root. The screenie is an example of a WP site.
301 is the proper response for the index file – be it PHP or HTML or whatever. They should NOT be both resolving… That said – most modern systems would automatically do that once you set the default in CPanel or whatever. So… whomever set it up didn't do it right.

πŸ“°πŸ‘ˆ

are the same page slug with and without index html counted as duplicate content
πŸ”—πŸΉ

Micha Β» Truslow
The "index.php" files in WordPress are not actual Web pages – they're script files. WordPress assembles each page you see on the site dynamically at run time by pulling the components from the database.
So we're talking about 2 different situations here. The "index.html" URLs are typically files on the disk – although some weird cms's will generate URLs that use the ".html" extension even though everything comes from a database.
So in this specific case, I'm very sure that SEMrush is finding malformed links in the HTML code. Rewriting will only mask the problem – it's not a fix for malformed links in the code.
Chris if you'd like to share a link (even by PM), we'll be able to reach agreement on what your problem is much more quickly if someone can see an actual page.
Or a screen capture from the SEMrush report might tell us if it's actually finding pages like this or just following bad URLs that generate errors.
At this point, we're guessing and we're like 2 blind men trying to describe an elephant.


Micha
Okay, Chris sent me the site's URL.
The site is not published in WordPress. I think the Content Management System (cms) is called Bootstrap? Not sure. Don't have time to check.
The ".index.html" links are malformed by the cms when it generates the HTML code for the pages. The internal links look like this:
a href="index.html"
If I click on the navigational elements, the site takes me to the correct URL. If I manually add the "index.html" to the end of the URL, I get a soft 404 page. So SEMrush is trying to crawl links to pages that aren't there.
It could be the URL rewrites are misconfigured, and I would look at that, too. But if the cms can generate absolute, fully resolved URLs for the internal links SEMrush will stop trying to crawl the non-existing URLs.
I will also guess there are soft 404 errors in Google Search Console. If so, then you've found a problem that needs to be fixed because the internal flow of PageRank is probably broken.
I don't know why the URL rewriting doesn't work as expected. Fixing that will bridge the gaps but the site will still experience a lot of pointless crawl, as every folder request will have to be served twice.
It's probably not going to affect crawl budget but it would annoy me no end. So I still recommend configuring the Content Management System (cms) to generate proper URLs for the internal links.

πŸ“°πŸ‘ˆ



an SEO Analyst Believes 301 redirection of an URL to the Same Slug Retains the Full Pagerank

Should Categories and Tags Not Be Indexed, some are Long like Article Slugs and Meaningless?



Leave a Reply

Your email address will not be published. Required fields are marked *